In my experiment design, the ratio of counts of RNA obtained between the WT and KO tissues need to be maintained during normalization. If I normalize all of my counts together in one big matrix, I lose the difference in intensity between these genotypes presumably because DESeq2 calculates across the entire matrix. Instead, the normalized counts for both genotypes come up close to equal. In reality, the WT and KO conditions are separate types of experiments and should not have equal normalized counts. If I normalize WT and KO separately with DESeq2 (two separate DESeq2 objects for either WT or KO), the ratio is maintained - the KO (background) is much lower, as I would expect.
However, now I do not see a way to then use the modeling capabilities of DESeq2 since it requires the raw counts in one object, not two separate ones. And upon running the analysis it will estimate size factors across the whole matrix.
Any clarification would be helpful. Thank you.
I know of maybe 5-10 genes that, a priori, would not be expected to be affected based on their characterization by alternative means (i.e. not RNA sequencing) in the literature. Also, is this along the same lines as using spike-ins? I don't think that will be an option for me at this point.
I can't really give much better advice on what to do. Unless you have a really good idea of a number of genes that are definitely not differentially expressed in your system, it's kind of a dangerous way to go. You need to be sure how to define what is constant across the samples. In most experiments, it is enough to rely on software because we do not have that most or all genes are greatly shifted. An RNA-seq experiment with global shifts in expression without any way to figure out what's constant is a problem.
Also: this is just a WT vs KO comparison, or is there more to the experimental design that you didn't mention?
There is more to my design, you can see DESeq2: Appropriate way to deal with knockouts in experiment design (RIPSeq) in an earlier question you answered. The issue I am having is that nobody does RIP-SEQ and it has some unique properties that make it different from RNA-SEQ, CLIP-SEQ, or ChIP-SEQ (as mentioned in the papers for RIPSeeker and Piranha). I keep running into issues and cannot seem to sort them out the way I want (if my approaches are even correct to begin with). Long story short, all my problems essentially stem from the fact that prior to sequencing, my IP RNA concentration is between 2-5x higher than the KO RNA concentration (background bead binding control) as expected, but this ratio between IP and KO is lost upon normalization. Therefore, my KO counts are highly inflated and this ends up affecting DE analysis, peak-calling analysis (using Piranha for example), or any other approaches I might try to use. A secondary issue is that when setting up my model matrix, it is not full rank so some columns I need for comparison are omitted by R automatically and I cannot feed them to DESeq2/EdgeR.
Ok, I didn't catch from the post above that you were doing RIP-seq. Your experimental design is sufficiently complex that I think it would be good if you found a statistical collaborator. It's possible that you don't need to perform normalization steps (you can substitute 1's for the size factors) if you are only interested in comparing ratios e.g. IP vs control across groups e.g. KO vs WT. But you should check with someone who can go over your design in more detail, because it's difficult for me to parse, and it goes beyond the scope of what I can handle on the support forum.
Sorry, I should have mentioned it was RIP-Seq. Unfortunately, I am having difficulty finding collaborators who know RIP-SEQ but I am seeing a ChIP-SEQ analyst next week who also deals with this IP v. KO question.
Last question: If treat my IP and KO as separate DESeq2 data and normalize each one as I described in my original post, are they comparable? Obviously I cannot do DE analysis with DESeq2 with these normalized counts but I am wondering if I can still at least get a sense of my "background" by comparing counts in the KO for a particular gene. Thanks for all your help.
I'll just explain what the normalized counts are: if you have a matrix of counts X, the normalized counts are the columns, scaled such that the samples with very low and very high sequencing depth have scaled counts in the same range as the samples with middle sequencing depth.
Whether it makes sense to normalize certain samples together and then combine them is a bit beyond my understanding of the experiment and your goals.