Question: DESeq2 testing ratio of ratios (RIP-Seq, CLIP-Seq, ribosomal profiling)
19
5.1 years ago by
Michael Love25k
United States
Michael Love25k wrote:

I received a question about RIP-Seq by email, and wanted to post a reminder about how to test for ratio of ratios using DESeq2.

Suppose we have two assays: Input and IP, and we have two conditions: A and B.

Then using a design: '~ assay + condition + assay:condition', the interaction term 'assay:condition' represents the ratio of the ratios: (IP for B / Input for B) / (IP for A / Input for A)... which with some algebra you can see is equivalent to (IP for B / IP for A) / (Input for B / Input for A).

Using a likelihood ratio test which removes the interaction term in the reduced model, we test whether the IP enrichment is different in condition B than in A:

dds <- DESeqDataSet(se, design= ~ assay + condition + assay:condition)
dds <- DESeq(dds, test="LRT", reduced= ~ assay + condition)
results(dds)


(Or similarly, using another DESeqDataSet constructor function).

Note: if there are only 4 samples total (no replicates), then there are not enough residual degrees of freedom to estimate the dispersion, and DESeq2 will automatically use a design of ~ 1 for estimating the dispersion (and will print a message about this). This means that the analysis is underpowered to detect differences, because we are overestimating the dispersion in ignoring the differences due to experimental design. See ?DESeq for more on this case.

Further questions:

what if the depth is lower for the IP samples? This is handled by the size factors in the model.

what if the distribution of counts is different for the IP samples? Suppose first that the counts are still decently modeled with a Negative Binomial. DESeq2 estimates a single dispersion parameter per gene. If the IP samples have a higher dispersion than the Input samples, then the dispersion estimate from all samples will be in-between the dispersion estimate you would get from the assays separately. I would guess that this wouldn't be too problematic, but it's hard to say without seeing the actual data, and will depend on how far apart the dispersions of the two assays are. Secondly, what if the IP counts are be poorly fit with a Negative Binomial? Again, it's hard to say exactly how this would effect the performance without knowing how far off from a Negative Binomial they are.

deseq2 • 6.3k views
modified 3.6 years ago • written 5.1 years ago by Michael Love25k
1

Hello Michael,

First of all thank you very much for the quick reply!

I have an additional question regarding to the normalization: I know that the normalization corrects for differences between sequencing depth, but what if all samples receive approximately the same total amount of reads, but in the IP samples we only have a small fraction of transcripts represented, yielding much more reads for those transcripts as apposed to the input samples, where all the expressed genes appear in the data. will the normalization still correct for the differences, even though the total amount of reads is identical for all samples?

In addition, after i will receive the data, what is the best way for me to check whether the IP samples can be modeled with the negative binomial distribution? and what is the best way for me to see how far apart are the dispersions of both assays?

Thank you very much,

Olga.

"what if all samples receive approximately the same total amount of reads, but in the IP samples we only have a small fraction of transcripts represented, yielding much more reads for those transcripts as apposed to the input samples, where all the expressed genes appear in the data"

This will be taken care of by the 'assay' term in the design specified above.

"after i will receive the data, what is the best way for me to check whether the IP samples can be modeled with the negative binomial distribution"

If you don't have any biological replicates, then it's hard to make any comparison of how well different parameterizations distributions fit. The Negative Binomial allows us to fit the mean and the variance, so it is fairly flexible. For data with many replicates for each individual condition, extreme bimodality within one condition (half the samples at 0 and half large counts) would be something not handled well by the Negative Binomial. With many replicates, a useful diagnostic is looking at histograms of the normalized counts for a single gene and a single condition. Also if you have many replicates, you can estimate the dispersion within each (i.e. build a dataset with only one condition and ~1) and compare them.

Thank you very much for the suggestions.

I would like to raise another question regarding the biological replicates.

As it seems, the intention of the researcher is to sequence 2 biological replicates for the IP samples and sequence only one input replicate for each condition. I know that it is not much and it is better to use more replicates, but will it be OK to work with input samples without replicates at all?

2 vs 1 has lower statistical power than 2 vs 2, but large consistent fold changes should still be detected.

Hey Micheal,

I was having problems setting colData for this type of analysis. My analysis is not RipSeq, but theoretically something very similar to RipSeq. I have two conditions "A" & "B". Each condition has a "fraction" samples and a "total" sample. I have 5 replicates for each condition. The comparison that I want to make is "factionA/totalA" vs "fractionB/totalB", which essentially looks like a RipSeq analysis. The other catch here is that, the "total" sample for both conditions are the same. So "totalA" & "totalB" are the same samples. So, in all I have 15 samples : 5 "fractionA", 5 "fractionB" & 5 "total" samples which are common for conditions A & B.

The colData that I currently have is as follows:

Samples condition       assay
F1       A    fraction
F2       A    fraction
F3       A    fraction
F4       A    fraction
F5       A    fraction
T1       A   T
T2       A    T
T3       A    T
T4       A    T
T5       A    T

F6       B     fraction
F7       B     fraction
F8       B     fraction
F9       B    fraction
F10     B     fraction

T6       B    T
T7       B     T
T8       B     T
T9       B     T
T10     B     T

T1-T5 are the same as T6-T10. I added them as separate to keep the order of the analysis. I keep getting the following error:

"the model matrix is not full rank, i.e. one or more variables in the design formula are linear combinations of the others"

which I assume is because of the incorrect order of the colData. I have tried a couple of other ways to set up the colData, but to no use. Any help on this would be greatly appreciated.

Thanks!

Shashank

hi Shashank,

You would use a different design if you only have 3 groups. Note that:

(fraction A / total) / (fraction B / total) = fraction A / fraction B.

Because the total samples are the same for both A and B, these fall out of the equation.

You should combine condition and assay into a new variable:

dds$group <- factor(paste0(dds$condition, dds$assay)) Then use a design of ~group. Then you can investigate three contrasts: fraction A vs total, fraction B vs total, or fraction B vs fraction A. ADD REPLYlink written 4.2 years ago by Michael Love25k Hey Micheal, thanks for redirecting me to this post. I tried to apply DESeq2 in this way to my polysome profiling data. However I am running into an issue and a question. First of all the issue, it has to do with extracting the contrast of interest. Since it did not work with my data I tried to run the examples given when I write ?results in R, specifically shown here example 2 of the results help section. dds <- makeExampleDESeqDataSet(n=100,m=12) dds$genotype <- factor(rep(rep(c("I","II"),each=3),2))
design(dds) <- ~ genotype + condition + genotype:condition
dds <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
resultsNames(dds)
[1] "Intercept"             "genotype_II_vs_I"      "condition_B_vs_A"
[4] "genotypeII.conditionB"
res <- results(dds)
res
Error in .classEnv(classDef) :
trying to get slot "package" from an object of a basic class ("NULL") with no slots

Do you have an idea what might cause this error? A search on the internet regarding this error in DESeq2 did not bring me any hints. (session info posted below)

Regarding my question, I use the following setup

  colTrans <- matrix(nrow=16,ncol=3)
colnames(colTrans) <- c("samples","condition","RNA")
colTrans[,"samples"] <- rep(c(1,2,3,4),4)
colTrans[,"condition"]<- c(rep("controlPoly",4),rep("treatmentPoly",4),rep("controlCyto",4),rep("treatmentCyto",4))
colTrans[,"RNA"] <- c(rep("P",8),rep("C",8))
colTrans <- as.data.frame(colTrans)
colTrans$group <- paste(colTrans$condition,colTrans$RNA) designTrans <- (~group) colTrans samples condition RNA group 1 1 controlPoly P controlPoly P 2 2 controlPoly P controlPoly P 3 3 controlPoly P controlPoly P 4 4 controlPoly P controlPoly P 5 1 treatmentPoly P treatmentPoly P 6 2 treatmentPoly P treatmentPoly P 7 3 treatmentPoly P treatmentPoly P 8 4 treatmentPoly P treatmentPoly P 9 1 controlCyto C controlCyto C 10 2 controlCyto C controlCyto C 11 3 controlCyto C controlCyto C 12 4 controlCyto C controlCyto C 13 1 treatmentCyto C treatmentCyto C 14 2 treatmentCyto C treatmentCyto C 15 3 treatmentCyto C treatmentCyto C 16 4 treatmentCyto C treatmentCyto C tde <- DESeqDataSetFromMatrix(countData=simCountsObj$combined,colData=colTrans,design=designTrans)
tde <- DESeq(tde,full=designTrans,test="Wald")

resultsNames(tde)
[1] "Intercept"            "groupcontrolCyto.C"   "groupcontrolPoly.P"
[4] "grouptreatmentCyto.C" "grouptreatmentPoly.P"
translation <- results(tde, contrast=list( c("groupcontrolPoly.P","grouptreatmentPoly.P"),c("groupcontrolCyto.C","grouptreatmentCyto.C")))

This should give me the contrast as from the example you of shashank fractionA/fractionB, if I am not mistaken... is this right?

Currently this gives me the same error as stated in my Issue.

Christian

sessioninfo:

sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.4 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods
[9] base

other attached packages:
[4] Rcpp_0.12.4                SummarizedExperiment_1.0.2 Biobase_2.30.0
[7] GenomicRanges_1.22.4       GenomeInfoDb_1.6.3         IRanges_2.4.8
[10] S4Vectors_0.8.11           BiocGenerics_0.16.1        ROCR_1.0-7
[13] gplots_3.0.1

loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-2   futile.logger_1.4.1  plyr_1.8.3
[4] XVector_0.10.0       bitops_1.0-6         futile.options_1.0.0
[7] tools_3.2.3          zlibbioc_1.16.0      rpart_4.1-10
[10] RSQLite_1.0.0        annotate_1.48.0      gtable_0.2.0
[13] lattice_0.20-33      DBI_0.3.1            gridExtra_2.2.1
[16] genefilter_1.52.1    cluster_2.0.3        gtools_3.5.0
[19] caTools_1.17.1       locfit_1.5-9.1       grid_3.2.3
[22] nnet_7.3-12          AnnotationDbi_1.32.3 XML_3.98-1.4
[25] survival_2.39-2      BiocParallel_1.4.3   foreign_0.8-66
[28] gdata_2.17.0         latticeExtra_0.6-28  Formula_1.2-1
[31] geneplotter_1.48.0   ggplot2_2.1.0        lambda.r_1.1.7
[34] Hmisc_3.17-3         scales_0.4.0         splines_3.2.3
[37] xtable_1.8-2         colorspace_1.2-6     KernSmooth_2.23-15
[40] acepack_1.3-3.3      munsell_0.4.3      

Something is funny here because DESeq2 is tested against that example. If DESeq2 threw an error with that example, it wouldn't be available on the Bioconductor page.

Can you try the example again in a fresh R session?

Your package versions seem fine, but can you try:

library(BiocInstaller)
biocValid()

I tried to restart the R session and it worked. There were phantom packages, which showed up in the sessioninfo() but were not actually loaded... after detaching them all worked fine.

Do I apply the right lists in the contrast to get the contrast for fractionA/fractionB or in my case Polysome/Cytosolic data? (see previous post second section)

Thanks for the help!
Cheers,

Christian

No, your contrast is not correct. Why don't you follow the example above?

That is what I was trying but I had the exact same problem with the "Model matrix not full rank". I found an issue in my colData and now it worked as far as I can tell.

Thanks for the help and patience!

Hello Michael,

Sorry to bother you with this 3.8-year old question again..

I understand what you are saying here: design= ~assay*condition, however, I think I start to get confused when I have biological repeats. Let say if I have 3 biological repeats, and I denote them as Biosample 1/2/3. The sample setup would then be as followed:

> biosample<-rep(c(1,2,3),c(4,4,4)) > Condition<-rep(c(rep("+",2),rep("-",2)),3) > Assay<-rep(rep(c("IP","Inp"),2),3) > sample_setup<-cbind(biosample,Treatment,IP) > sample_setup       biosample Condition Assay     [1,] "1"       "+"       "IP"   [2,] "1"       "+"       "Inp"  [3,] "1"       "-"       "IP"   [4,] "1"       "-"       "Inp"  [5,] "2"       "+"       "IP"   [6,] "2"       "+"       "Inp"  [7,] "2"       "-"       "IP"   [8,] "2"       "-"       "Inp"  [9,] "3"       "+"       "IP"  [10,] "3"       "+"       "Inp" [11,] "3"       "-"       "IP"  [12,] "3"       "-"       "Inp"

In this case, I interested in knowing the effect of the treatment, how shall I model my experiment then? should it be design=~Assay*Condition + Biosample?

Thank you very much

Yes but it's good to put the nuisance parameters first in the design, because by default many packages assume you want to test the last parameter (of course you can be specific about which one to test, but in case someone later runs your scripts and doesn't know this...).

So I'd do ~biosample + condition + assay + condition:assay.

I also like to write out the above, which is equivalent to ~biosample + condition*assay, because it helps beginners see that there will be two main effects and an interaction term.

Hi Michael,

I have a slight variation of the ribosome profiling/RNA-seq data above so I will just ask my questions here. I'd really appreciate your help!

I have 4 different conditions: 2 strains grown at 2 temperatures, each condition has ribo-seq and RNA-seq data, and two replicates for each. From what I've read so far on this site, it seems like it's more sensible for me to keep the strain and temperature combined as one condition, instead of having them in separate columns? Here is what I have:

> sample_type

         condition type MA25_RPF       M25  RPF MB25_RPF       M25  RPF MA25_RNA       M25  RNA MB25_RNA       M25  RNA MA37_RPF       M37  RPF MB37_RPF       M37  RPF MA37_RNA       M37  RNA MB37_RNA       M37  RNA WA25_RPF       W25  RPF WB25_RPF       W25  RPF WA25_RNA       W25  RNA WB25_RNA       W25  RNA WA37_RPF       W37  RPF WB37_RPF       W37  RPF WA37_RNA       W37  RNA WB37_RNA       W37  RNA

> all.ds <- DESeqDataSetFromMatrix(countData = count_table,
colData = sample_type,
design = ~type + condition + type:condition)​

> all.DGE <- DESeq(all.ds)​

I'd like to compare Mutant vs WT for each temperature, and also 25C vs 37C for each strain, after normalization of ribo- to RNA-seq. I set "W25" as the reference. Is this design ok for what I want to do?

If this design makes sense, then I have a follow-up question. From what I understand of the results based on the resultsNames (please correct me if I'm wrong), "typeRPF.conditionM25" would mean comparing RPF/RNA of M25 to RPF/RNA of W25. But for "condition_M25_vs_W25", it does not include the interaction term and is only comparing the RPF. If I want to compare the RPF/RNA of one sample to another sample that is not the reference, I would need to use list() in contrast, is that right? I'm not sure which of these terms I should use to go into list() for, say, if I want to compare RPF/RNA of M37 to RPF/RNA of W37?

> resultsNames(all.DGE)
[1] "Intercept"            "type_RPF_vs_RNA"      "condition_W37_vs_W25" "condition_M25_vs_W25" "condition_M37_vs_W25"
[6] "typeRPF.conditionW37" "typeRPF.conditionM25" "typeRPF.conditionM37"

Kotcha

Would you mind posting this as a new question just so it’s easier to follow the thread? It’s worth it’s own post I think.

Hi Michael,

Thank you for your reply. I actually found another post that you have answered (DESeq2 for Ribosomal profiling analysis [two-factor designs]) that I think helped answer my questions. I will repost this question anyway though, along with the update of what I changed after reading the said post, because I am still not sure if I am using DESeq2 correctly.

Kotcha