ribosome profiling analysis with different dispersions
Entering edit mode
Jake ▴ 90
Last seen 6 weeks ago
United States


I have been using DESeq2 to analyze ribosome profiling data with the following formula and normalizing each library type (ribosome fragments and mRNA input) using separate size factors as suggested by Michael Love in a previous post.

~ assay + condition + assay:condition

sf <- numeric(ncol(dds))
idx1 <- dds$assay == "assay1"
sf[ idx1 ] <- estimateSizeFactorsForMatrix(counts(dds)[ , idx1])
# repeat for assay2
sizeFactors(dds) <- sf
# continue with DESeq()

I recently came across this paper ( http://biorxiv.org/content/early/2015/04/10/017111 ). They use a similar linear model to analyze ribosome profiling data with two important differences:

1) They calculate the dispersion seperately for ribosome profiling data and input mRNA data.

2) To determine the significance of changes in the interaction term, I believe that they use a different method than DESeq2 uses. They state "In a treatment/control setting, we can then evaluate whether a treatment (C = 1) has a significant differential effect on translation efficiency compared to control (C = 0), which is equivalent to determining whether the inferred parameter β∆,1 differs significantly from 0. This is whether the relationship denoted by the dashed line in Fig. 1A is needed or not. We can compute significance levels based on the χ2 distribution by analyzing log-likelihood ratios of the Null model (βi = 0) and the alternative model (βi =/= 0). "

I was wondering if anyone with more background in the statistics of rna sequence analysis could weigh on with their thoughts on these differences and if one method is better than the other? Does DESeq2 support separate dispersion estimates for comparing different assay types?

deseq2 linear model ribosome profiling stats • 1.8k views
Entering edit mode
Last seen 2.1 years ago
Scripps Research, La Jolla, CA

For dispersion estimation, my understanding is that both edgeR and DESeq2 have made the deliberate choice to only estimate a single dispersion parameter for each gene because most RNA-seq experiments don't even have enough replicates to estimate one dispersion per gene (hence the need for sharing information between genes), let alone 2. So even if treatment groups have different dispersions, having one mediocre dispersion estimate representing the average across all groups is still better than multiple worse dispersion estimates.

For the significance testing, It sounds like they are using the standard likelihood ratio test for GLMs, which is supported by both DESeq2 and edgeR.

If you want to use a Bioconductor package to fit a model that allows different treatment groups to have different variances, you can use voomWithQualityWeights from the limma package.

Entering edit mode

I don't have anything to add to Ryan's answer. I haven't looked in depth at this in particular, but there can certainly be cases where the dispersion (essentially the coefficient of variation) is so different across experiment types that extra modeling is worthwhile, although one has to consider the additional cost in terms of increased estimation variance (less samples for each parameter).

Entering edit mode
Last seen 2.1 years ago
Scripps Research, La Jolla, CA

As another option, you could follow the methodology in this paper to fit a model with separate dispersions for the two assay types:


In that paper, the two "assay types" are the "this exon" and "other exon" counts, but I think the code would work for your use case as well.


Login before adding your answer.

Traffic: 226 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6