Search
Question: DESeq2 testing ratio of ratios (RIP-Seq, CLIP-Seq, ribosomal profiling)
14
3.7 years ago by
Michael Love17k
United States
Michael Love17k wrote:

I received a question about RIP-Seq by email, and wanted to post a reminder about how to test for ratio of ratios using DESeq2.

Suppose we have two assays: Input and IP, and we have two conditions: A and B.

Then using a design: '~ assay + condition + assay:condition', the interaction term 'assay:condition' represents the ratio of the ratios: (IP for B / Input for B) / (IP for A / Input for A)... which with some algebra you can see is equivalent to (IP for B / IP for A) / (Input for B / Input for A).

Using a likelihood ratio test which removes the interaction term in the reduced model, we test whether the IP enrichment is different in condition B than in A:

dds <- DESeqDataSet(se, design= ~ assay + condition + assay:condition)
dds <- DESeq(dds, test="LRT", reduced= ~ assay + condition)
results(dds)


(Or similarly, using another DESeqDataSet constructor function).

Note: if there are only 4 samples total (no replicates), then there are not enough residual degrees of freedom to estimate the dispersion, and DESeq2 will automatically use a design of ~ 1 for estimating the dispersion (and will print a message about this). This means that the analysis is underpowered to detect differences, because we are overestimating the dispersion in ignoring the differences due to experimental design. See ?DESeq for more on this case.

Further questions:

what if the depth is lower for the IP samples? This is handled by the size factors in the model.

what if the distribution of counts is different for the IP samples? Suppose first that the counts are still decently modeled with a Negative Binomial. DESeq2 estimates a single dispersion parameter per gene. If the IP samples have a higher dispersion than the Input samples, then the dispersion estimate from all samples will be in-between the dispersion estimate you would get from the assays separately. I would guess that this wouldn't be too problematic, but it's hard to say without seeing the actual data, and will depend on how far apart the dispersions of the two assays are. Secondly, what if the IP counts are be poorly fit with a Negative Binomial? Again, it's hard to say exactly how this would effect the performance without knowing how far off from a Negative Binomial they are.

modified 2.2 years ago • written 3.7 years ago by Michael Love17k
1

Hello Michael,

First of all thank you very much for the quick reply!

I have an additional question regarding to the normalization: I know that the normalization corrects for differences between sequencing depth, but what if all samples receive approximately the same total amount of reads, but in the IP samples we only have a small fraction of transcripts represented, yielding much more reads for those transcripts as apposed to the input samples, where all the expressed genes appear in the data. will the normalization still correct for the differences, even though the total amount of reads is identical for all samples?

In addition, after i will receive the data, what is the best way for me to check whether the IP samples can be modeled with the negative binomial distribution? and what is the best way for me to see how far apart are the dispersions of both assays?

Thank you very much,

Olga.

"what if all samples receive approximately the same total amount of reads, but in the IP samples we only have a small fraction of transcripts represented, yielding much more reads for those transcripts as apposed to the input samples, where all the expressed genes appear in the data"

This will be taken care of by the 'assay' term in the design specified above.

"after i will receive the data, what is the best way for me to check whether the IP samples can be modeled with the negative binomial distribution"

If you don't have any biological replicates, then it's hard to make any comparison of how well different parameterizations distributions fit. The Negative Binomial allows us to fit the mean and the variance, so it is fairly flexible. For data with many replicates for each individual condition, extreme bimodality within one condition (half the samples at 0 and half large counts) would be something not handled well by the Negative Binomial. With many replicates, a useful diagnostic is looking at histograms of the normalized counts for a single gene and a single condition. Also if you have many replicates, you can estimate the dispersion within each (i.e. build a dataset with only one condition and ~1) and compare them.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Michael Love17k

Thank you very much for the suggestions.

I would like to raise another question regarding the biological replicates.

As it seems, the intention of the researcher is to sequence 2 biological replicates for the IP samples and sequence only one input replicate for each condition. I know that it is not much and it is better to use more replicates, but will it be OK to work with input samples without replicates at all?

2 vs 1 has lower statistical power than 2 vs 2, but large consistent fold changes should still be detected.

Hey Micheal,

I was having problems setting colData for this type of analysis. My analysis is not RipSeq, but theoretically something very similar to RipSeq. I have two conditions "A" & "B". Each condition has a "fraction" samples and a "total" sample. I have 5 replicates for each condition. The comparison that I want to make is "factionA/totalA" vs "fractionB/totalB", which essentially looks like a RipSeq analysis. The other catch here is that, the "total" sample for both conditions are the same. So "totalA" & "totalB" are the same samples. So, in all I have 15 samples : 5 "fractionA", 5 "fractionB" & 5 "total" samples which are common for conditions A & B.

The colData that I currently have is as follows:

Samples condition       assay
F1       A    fraction
F2       A    fraction
F3       A    fraction
F4       A    fraction
F5       A    fraction
T1       A   T
T2       A    T
T3       A    T
T4       A    T
T5       A    T

F6       B     fraction
F7       B     fraction
F8       B     fraction
F9       B    fraction
F10     B     fraction

T6       B    T
T7       B     T
T8       B     T
T9       B     T
T10     B     T

T1-T5 are the same as T6-T10. I added them as separate to keep the order of the analysis. I keep getting the following error:

"the model matrix is not full rank, i.e. one or more variables in the design formula are linear combinations of the others"

which I assume is because of the incorrect order of the colData. I have tried a couple of other ways to set up the colData, but to no use. Any help on this would be greatly appreciated.

Thanks!

Shashank

hi Shashank,

You would use a different design if you only have 3 groups. Note that:

(fraction A / total) / (fraction B / total) = fraction A / fraction B.

Because the total samples are the same for both A and B, these fall out of the equation.

You should combine condition and assay into a new variable:

dds$group <- factor(paste0(dds$condition, dds$assay)) Then use a design of ~group. Then you can investigate three contrasts: fraction A vs total, fraction B vs total, or fraction B vs fraction A. ADD REPLYlink written 2.8 years ago by Michael Love17k Hey Micheal, thanks for redirecting me to this post. I tried to apply DESeq2 in this way to my polysome profiling data. However I am running into an issue and a question. First of all the issue, it has to do with extracting the contrast of interest. Since it did not work with my data I tried to run the examples given when I write ?results in R, specifically shown here example 2 of the results help section. dds <- makeExampleDESeqDataSet(n=100,m=12) dds$genotype <- factor(rep(rep(c("I","II"),each=3),2))
design(dds) <- ~ genotype + condition + genotype:condition
dds <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
resultsNames(dds)
[1] "Intercept"             "genotype_II_vs_I"      "condition_B_vs_A"
[4] "genotypeII.conditionB"
res <- results(dds)
res
Error in .classEnv(classDef) :
trying to get slot "package" from an object of a basic class ("NULL") with no slots

Do you have an idea what might cause this error? A search on the internet regarding this error in DESeq2 did not bring me any hints. (session info posted below)

Regarding my question, I use the following setup

  colTrans <- matrix(nrow=16,ncol=3)
colnames(colTrans) <- c("samples","condition","RNA")
colTrans[,"samples"] <- rep(c(1,2,3,4),4)
colTrans[,"condition"]<- c(rep("controlPoly",4),rep("treatmentPoly",4),rep("controlCyto",4),rep("treatmentCyto",4))
colTrans[,"RNA"] <- c(rep("P",8),rep("C",8))
colTrans <- as.data.frame(colTrans)
colTrans$group <- paste(colTrans$condition,colTrans$RNA) designTrans <- (~group) colTrans samples condition RNA group 1 1 controlPoly P controlPoly P 2 2 controlPoly P controlPoly P 3 3 controlPoly P controlPoly P 4 4 controlPoly P controlPoly P 5 1 treatmentPoly P treatmentPoly P 6 2 treatmentPoly P treatmentPoly P 7 3 treatmentPoly P treatmentPoly P 8 4 treatmentPoly P treatmentPoly P 9 1 controlCyto C controlCyto C 10 2 controlCyto C controlCyto C 11 3 controlCyto C controlCyto C 12 4 controlCyto C controlCyto C 13 1 treatmentCyto C treatmentCyto C 14 2 treatmentCyto C treatmentCyto C 15 3 treatmentCyto C treatmentCyto C 16 4 treatmentCyto C treatmentCyto C tde <- DESeqDataSetFromMatrix(countData=simCountsObj$combined,colData=colTrans,design=designTrans)
tde <- DESeq(tde,full=designTrans,test="Wald")

resultsNames(tde)
[1] "Intercept"            "groupcontrolCyto.C"   "groupcontrolPoly.P"
[4] "grouptreatmentCyto.C" "grouptreatmentPoly.P"
translation <- results(tde, contrast=list( c("groupcontrolPoly.P","grouptreatmentPoly.P"),c("groupcontrolCyto.C","grouptreatmentCyto.C")))

This should give me the contrast as from the example you of shashank fractionA/fractionB, if I am not mistaken... is this right?

Currently this gives me the same error as stated in my Issue.

Christian

sessioninfo:

sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.4 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods
[9] base

other attached packages:
[1] BiocInstaller_1.20.1       DESeq2_1.10.1              RcppArmadillo_0.6.700.3.0
[4] Rcpp_0.12.4                SummarizedExperiment_1.0.2 Biobase_2.30.0
[7] GenomicRanges_1.22.4       GenomeInfoDb_1.6.3         IRanges_2.4.8
[10] S4Vectors_0.8.11           BiocGenerics_0.16.1        ROCR_1.0-7
[13] gplots_3.0.1

loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-2   futile.logger_1.4.1  plyr_1.8.3
[4] XVector_0.10.0       bitops_1.0-6         futile.options_1.0.0
[7] tools_3.2.3          zlibbioc_1.16.0      rpart_4.1-10
[10] RSQLite_1.0.0        annotate_1.48.0      gtable_0.2.0
[13] lattice_0.20-33      DBI_0.3.1            gridExtra_2.2.1
[16] genefilter_1.52.1    cluster_2.0.3        gtools_3.5.0
[19] caTools_1.17.1       locfit_1.5-9.1       grid_3.2.3
[22] nnet_7.3-12          AnnotationDbi_1.32.3 XML_3.98-1.4
[25] survival_2.39-2      BiocParallel_1.4.3   foreign_0.8-66
[28] gdata_2.17.0         latticeExtra_0.6-28  Formula_1.2-1
[31] geneplotter_1.48.0   ggplot2_2.1.0        lambda.r_1.1.7
[34] Hmisc_3.17-3         scales_0.4.0         splines_3.2.3
[37] xtable_1.8-2         colorspace_1.2-6     KernSmooth_2.23-15
[40] acepack_1.3-3.3      munsell_0.4.3      

Something is funny here because DESeq2 is tested against that example. If DESeq2 threw an error with that example, it wouldn't be available on the Bioconductor page.

Can you try the example again in a fresh R session?

Your package versions seem fine, but can you try:

library(BiocInstaller)
biocValid()

I tried to restart the R session and it worked. There were phantom packages, which showed up in the sessioninfo() but were not actually loaded... after detaching them all worked fine.

Do I apply the right lists in the contrast to get the contrast for fractionA/fractionB or in my case Polysome/Cytosolic data? (see previous post second section)

Thanks for the help!
Cheers,

Christian

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by c.oertlin0
No, your contrast is not correct. Why don't you follow the example above?

That is what I was trying but I had the exact same problem with the "Model matrix not full rank". I found an issue in my colData and now it worked as far as I can tell.

Thanks for the help and patience!