4 months ago by
Cambridge, United Kingdom
Hm. The concepts seem pretty obvious to me, but then again I suppose they would be.
- Common dispersion is the mean dispersion across all genes.
- Trended dispersion is the mean dispersion across all genes with similar abundance. In other words, the fitted value of the mean-dispersion trend.
I suppose I should qualify this by saying that we don't literally compute the mean in edgeR, but rather take the dispersion value that maximizes the adjusted profile likelihood. This is unlikely to be helpful for your audience, so you can probably just simplify it by saying that it's the mean.
For the tagwise dispersions, perhaps the simplest way to proceed would be to make a BCV plot from the output of
prior.df=0. This will yield dispersion estimates without any empirical Bayes shrinkage, i.e., "raw" tagwise estimates that do not share information across genes. You can then compare this to the BCV plot with the default
prior.df; you should see that the points are squeezed towards the trend (or towards the common value in red, if you set
trend="none"). This should demonstrate how empirical Bayes shrinkage works, effectively squeezing values together to reduce the effect of estimation uncertainty when you have low numbers of replicates.
Now, as for how this affects hypothesis testing - there are three main points, in order of obviousness:
- Larger dispersions = higher variance between replicates, which reduce power to detect DE.
- The performance of the model (and thus of the DE analysis) depends on the accuracy of the dispersion estimate. If there is a strong mean-dispersion trend, the common dispersion is obviously unsuitable. If the gene-specific dispersions vary around the trend, the trended dispersion is unsuitable. The "raw" tagwise estimates are unbiased estimates of the gene-specific dispersions, and should be the most suitable, except...
- The performance of the model also depends on the precision with which the dispersions are estimated. Here, the raw estimates are least stable as they use the least amount of information, whereas the trended (and to a greater extent, common) dispersion estimates share information between genes for greater stability. This is why the shrunken tagwise estimates (that you get with default
estimateDisp) are so useful, as they provide a compromise between precision and accuracy.
You may already know that we are now recommending the QL framework with
glmQLFTest for routine GLM-based DE analyses. This introduces another set of concepts, namely the distinction between negative binomial dispersions and quasi-likelihood dispersions. Long story short, the NB dispersions aim to model the mean-dispersion relationship across the entire dataset, while the QL dispersions aim to capture the variability and estimation uncertainty of the dispersion for each gene.
modified 4 months ago
4 months ago by
Aaron Lun • 21k