I received a question about RIP-Seq by email, and wanted to post a reminder about how to test for ratio of ratios using DESeq2.
Suppose we have two assays: Input and IP, and we have two conditions: A and B.
Then using a design: '~ assay + condition + assay:condition', the interaction term 'assay:condition' represents the ratio of the ratios: (IP for B / Input for B) / (IP for A / Input for A)... which with some algebra you can see is equivalent to (IP for B / IP for A) / (Input for B / Input for A).
Using a likelihood ratio test which removes the interaction term in the reduced model, we test whether the IP enrichment is different in condition B than in A:
dds <- DESeqDataSet(se, design= ~ assay + condition + assay:condition) dds <- DESeq(dds, test="LRT", reduced= ~ assay + condition) results(dds)
(Or similarly, using another DESeqDataSet constructor function).
Note: if there are only 4 samples total (no replicates), then there are not enough residual degrees of freedom to estimate the dispersion, and DESeq2 will automatically use a design of ~ 1 for estimating the dispersion (and will print a message about this). This means that the analysis is underpowered to detect differences, because we are overestimating the dispersion in ignoring the differences due to experimental design. See ?DESeq for more on this case.
Further questions:
what if the depth is lower for the IP samples? This is handled by the size factors in the model.
what if the distribution of counts is different for the IP samples? Suppose first that the counts are still decently modeled with a Negative Binomial. DESeq2 estimates a single dispersion parameter per gene. If the IP samples have a higher dispersion than the Input samples, then the dispersion estimate from all samples will be in-between the dispersion estimate you would get from the assays separately. I would guess that this wouldn't be too problematic, but it's hard to say without seeing the actual data, and will depend on how far apart the dispersions of the two assays are. Secondly, what if the IP counts are be poorly fit with a Negative Binomial? Again, it's hard to say exactly how this would effect the performance without knowing how far off from a Negative Binomial they are.
Hello Michael,
First of all thank you very much for the quick reply!
I have an additional question regarding to the normalization: I know that the normalization corrects for differences between sequencing depth, but what if all samples receive approximately the same total amount of reads, but in the IP samples we only have a small fraction of transcripts represented, yielding much more reads for those transcripts as apposed to the input samples, where all the expressed genes appear in the data. will the normalization still correct for the differences, even though the total amount of reads is identical for all samples?
In addition, after i will receive the data, what is the best way for me to check whether the IP samples can be modeled with the negative binomial distribution? and what is the best way for me to see how far apart are the dispersions of both assays?
Thank you very much,
Olga.
"what if all samples receive approximately the same total amount of reads, but in the IP samples we only have a small fraction of transcripts represented, yielding much more reads for those transcripts as apposed to the input samples, where all the expressed genes appear in the data"
This will be taken care of by the 'assay' term in the design specified above.
"after i will receive the data, what is the best way for me to check whether the IP samples can be modeled with the negative binomial distribution"
If you don't have any biological replicates, then it's hard to make any comparison of how well different
parameterizationsdistributions fit. The Negative Binomial allows us to fit the mean and the variance, so it is fairly flexible. For data with many replicates for each individual condition, extreme bimodality within one condition (half the samples at 0 and half large counts) would be something not handled well by the Negative Binomial. With many replicates, a useful diagnostic is looking at histograms of the normalized counts for a single gene and a single condition. Also if you have many replicates, you can estimate the dispersion within each (i.e. build a dataset with only one condition and ~1) and compare them.Thank you very much for the suggestions.
I would like to raise another question regarding the biological replicates.
As it seems, the intention of the researcher is to sequence 2 biological replicates for the IP samples and sequence only one input replicate for each condition. I know that it is not much and it is better to use more replicates, but will it be OK to work with input samples without replicates at all?
2 vs 1 has lower statistical power than 2 vs 2, but large consistent fold changes should still be detected.
Hello Michael,
Sorry to bother you with this 3.8-year old question again..
I understand what you are saying here: design= ~assay*condition, however, I think I start to get confused when I have biological repeats. Let say if I have 3 biological repeats, and I denote them as Biosample 1/2/3. The sample setup would then be as followed:
> biosample<-rep(c(1,2,3),c(4,4,4))
> Condition<-rep(c(rep("+",2),rep("-",2)),3)
> Assay<-rep(rep(c("IP","Inp"),2),3)
> sample_setup<-cbind(biosample,Treatment,IP)
> sample_setup
biosample Condition Assay
[1,] "1" "+" "IP"
[2,] "1" "+" "Inp"
[3,] "1" "-" "IP"
[4,] "1" "-" "Inp"
[5,] "2" "+" "IP"
[6,] "2" "+" "Inp"
[7,] "2" "-" "IP"
[8,] "2" "-" "Inp"
[9,] "3" "+" "IP"
[10,] "3" "+" "Inp"
[11,] "3" "-" "IP"
[12,] "3" "-" "Inp"
In this case, I interested in knowing the effect of the treatment, how shall I model my experiment then? should it be design=~Assay*Condition + Biosample?
Thank you very much
Yes but it's good to put the nuisance parameters first in the design, because by default many packages assume you want to test the last parameter (of course you can be specific about which one to test, but in case someone later runs your scripts and doesn't know this...).
So I'd do
~biosample + condition + assay + condition:assay
.I also like to write out the above, which is equivalent to
~biosample + condition*assay
, because it helps beginners see that there will be two main effects and an interaction term.Hi Michael,
I have a slight variation of the ribosome profiling/RNA-seq data above so I will just ask my questions here. I'd really appreciate your help!
I have 4 different conditions: 2 strains grown at 2 temperatures, each condition has ribo-seq and RNA-seq data, and two replicates for each. From what I've read so far on this site, it seems like it's more sensible for me to keep the strain and temperature combined as one condition, instead of having them in separate columns? Here is what I have:
condition type
MA25_RPF M25 RPF
MB25_RPF M25 RPF
MA25_RNA M25 RNA
MB25_RNA M25 RNA
MA37_RPF M37 RPF
MB37_RPF M37 RPF
MA37_RNA M37 RNA
MB37_RNA M37 RNA
WA25_RPF W25 RPF
WB25_RPF W25 RPF
WA25_RNA W25 RNA
WB25_RNA W25 RNA
WA37_RPF W37 RPF
WB37_RPF W37 RPF
WA37_RNA W37 RNA
WB37_RNA W37 RNA
I'd like to compare Mutant vs WT for each temperature, and also 25C vs 37C for each strain, after normalization of ribo- to RNA-seq. I set "W25" as the reference. Is this design ok for what I want to do?
If this design makes sense, then I have a follow-up question. From what I understand of the results based on the resultsNames (please correct me if I'm wrong),
"typeRPF.conditionM25"
would mean comparing RPF/RNA of M25 to RPF/RNA of W25. But for"condition_M25_vs_W25"
, it does not include the interaction term and is only comparing the RPF. If I want to compare the RPF/RNA of one sample to another sample that is not the reference, I would need to uselist()
incontrast
, is that right? I'm not sure which of these terms I should use to go intolist()
for, say, if I want to compare RPF/RNA of M37 to RPF/RNA of W37?Thank you in advance for your help!
Kotcha
Would you mind posting this as a new question just so it’s easier to follow the thread? It’s worth it’s own post I think.
Hi Michael,
Thank you for your reply. I actually found another post that you have answered (DESeq2 for Ribosomal profiling analysis [two-factor designs]) that I think helped answer my questions. I will repost this question anyway though, along with the update of what I changed after reading the said post, because I am still not sure if I am using DESeq2 correctly.
Kotcha
Hey Micheal,
I was having problems setting colData for this type of analysis. My analysis is not RipSeq, but theoretically something very similar to RipSeq. I have two conditions "A" & "B". Each condition has a "fraction" samples and a "total" sample. I have 5 replicates for each condition. The comparison that I want to make is "factionA/totalA" vs "fractionB/totalB", which essentially looks like a RipSeq analysis. The other catch here is that, the "total" sample for both conditions are the same. So "totalA" & "totalB" are the same samples. So, in all I have 15 samples : 5 "fractionA", 5 "fractionB" & 5 "total" samples which are common for conditions A & B.
The colData that I currently have is as follows:
Samples condition assay
F1 A fraction
F2 A fraction
F3 A fraction
F4 A fraction
F5 A fraction
T1 A T
T2 A T
T3 A T
T4 A T
T5 A T
F6 B fraction
F7 B fraction
F8 B fraction
F9 B fraction
F10 B fraction
T6 B T
T7 B T
T8 B T
T9 B T
T10 B T
T1-T5 are the same as T6-T10. I added them as separate to keep the order of the analysis. I keep getting the following error:
"the model matrix is not full rank, i.e. one or more variables in the design formula are linear combinations of the others"
which I assume is because of the incorrect order of the colData. I have tried a couple of other ways to set up the colData, but to no use. Any help on this would be greatly appreciated.
Thanks!
Shashank
hi Shashank,
You would use a different design if you only have 3 groups. Note that:
(fraction A / total) / (fraction B / total) = fraction A / fraction B.
Because the total samples are the same for both A and B, these fall out of the equation.
You should combine condition and assay into a new variable:
dds$group <- factor(paste0(dds$condition, dds$assay))
Then use a design of ~group.
Then you can investigate three contrasts: fraction A vs total, fraction B vs total, or fraction B vs fraction A.
Hey Micheal,
thanks for redirecting me to this post.
I tried to apply DESeq2 in this way to my polysome profiling data. However I am running into an issue and a question.
First of all the issue, it has to do with extracting the contrast of interest. Since it did not work with my data I tried to run the examples given when I write ?results in R, specifically shown here example 2 of the results help section.
Do you have an idea what might cause this error? A search on the internet regarding this error in DESeq2 did not bring me any hints. (session info posted below)
Regarding my question, I use the following setup
This should give me the contrast as from the example you of shashank fractionA/fractionB, if I am not mistaken... is this right?
Currently this gives me the same error as stated in my Issue.
Thanks in advance,
Christian
sessioninfo:
Something is funny here because DESeq2 is tested against that example. If DESeq2 threw an error with that example, it wouldn't be available on the Bioconductor page.
Can you try the example again in a fresh R session?
Your package versions seem fine, but can you try:
I tried to restart the R session and it worked. There were phantom packages, which showed up in the sessioninfo() but were not actually loaded... after detaching them all worked fine.
Do I apply the right lists in the contrast to get the contrast for fractionA/fractionB or in my case Polysome/Cytosolic data? (see previous post second section)
Thanks for the help!
Cheers,
Christian
That is what I was trying but I had the exact same problem with the "Model matrix not full rank". I found an issue in my colData and now it worked as far as I can tell.
Thanks for the help and patience!
Hi Michael, In your code, you used results(dds). Is it correct or necessary to use lfcShrink() to shrink the LFC after the "LRT" test? Does lfcShrink() only affect lowly expressed genes but have very little effects on genes with high expression?
You can use lfcShrink following results to produce an effect size for ranking or visualization. It's not necessary, and yes, if you see the apeglm paper (where we discuss and compare shrinkage estimation from different packages in depth) you see it primarily affects lowly expressed or noisy genes.
Hi Michael,
Thank you for the answers above. What might you suggest for a RIP-seq experiment with three assays - IP, input and IgG non-specific antibody, and two conditions - A and B?
Is it appropriate to do the likelihood ratio test for IP/input, and the likelihood ratio test for IgG/input, then compare them in some way?
Thanks in advance for your help,
Jo
For working with a 3 x 2 design I'd recommend collaborating with a statistician to ensure you are setting up the statistical analysis and interpreting results as planned.