hi Nik,
These are very good questions. For simplicity, I will be speaking of within-group estimates: the estimates of dispersion tend toward zero occur essentially whenever the variance is less than the mean, even with a positive true dispersion for the distribution. This happens often when the counts are low. For example:
varmean <- function(x) var(x)/mean(x)
sum(replicate(100, varmean(rnbinom(n=3,mu=10,size=10))) < 1)
So the low values are often not a violation of the modeling assumptions, but just occur due to random sampling.
Raising low dispersion values is a conservative choice, as is not lowering very high values. The real statistical challenge with RNA-seq differential expression is typically small replicate size (n=3-5), and so doing conservative things regarding shrinkage (which is strongest when sample sizes are small and which disappears as sample size grows) makes sense.
Regarding (2), a few points: yes, if the log dispersion values do follow a Normal then we are not shrinking for ~5%. But again we think this is the best choice. The MLE values are another valid estimator for dispersion, and here we are just making a conservative choice when sample size is very small and shrinkage would be strong.
Regarding differential expression (DE), note that dispersion and DE status are not linked. These genes with high dispersion are not the DE genes, they are the genes with extremely high within-group variance.
Of course, there could be DE genes in this set of genes with high dispersion (as there can be DE genes at all levels of dispersion), but these are the hardest genes to include in an FDR-bounded set, because the noise level is so high. The point of a statistical method is to maximize sensitivity while controlling the FDR, which inevitably means not calling such genes DE, so as to control the FDR. Here we are not guaranteeing that we won't call these genes DE, but we are not moderating dispersion and so the DE evidence needs to be strong as the gene-level (without the information sharing across genes).
Thank you for replying. Maybe you could consider some kind of a three component model for the dispersion parameter.
So, the DESeq2 methods are basically fixed. There are many benchmarking papers showing we have very good sensitivity while controlling specificity or precision, and we don't want our method to be a moving target. You can imagine trying to have more sophisticated approaches to all the estimation tasks in gene-level DE, but then you have to guarantee that these will work well not just on simulation but on a variety of shapes and sizes of real datasets, and won't do something strange or unexpected. I like the current DESeq2 methods because our performance is good according to independent groups doing benchmarking and because I know that many groups have tested them on all kind of data and they give reasonable answers (or else I would hear about the problems on the support site).
Hi Nik,
As a side note, there is a paper on mixture modelling with the dispersions and associated software:
Bonafede et. al. - Modeling overdispersion heterogeneity in differential expression analysis using mixtures - 2015
https://cran.r-project.org/web/packages/mixtNB/
http://arxiv.org/abs/1410.8093v2
www.dx.doi.org/10.1111/biom.12458
which might be of interest to you.
Bernd
Good point Bernd. Actually that happens to capture the discussion here, because they show with simulation that DESeq2 has a FPR rate a little less than the target alpha, so too conservative of inference.
That's queer. I came across this paper
http://www.lifesci.dundee.ac.uk/sites/www.lifesci.dundee.ac.uk/files/schurch%20et%20al%202016.pdf
where in Fig 4 one can see that DESeq2 does not control FPR. Since the shrinkage of variance parameters is conservative, it's probably due to the shrinkage of regression coefficients. The latter is absent in DESeq which does not have the FPR problem, according to the paper.
There was an error in the original evaluations in which DESeq2 filtered genes were assigned a p-value of 0 (note: by the authors scripts, not by DESeq2).
The authors have issued a correction, see the latest:
http://rnajournal.cshlp.org/content/early/2016/03/28/rna.053959.115
And the corresponding authors blog post:
https://geoffbarton.wordpress.com/2016/04/21/how-many-biological-replicates-are-needed-in-an-rna-seq-experiment-and-which-differential-expression-tool-should-you-use/
Hi Nik,
As Mike says, there was an error in the original early-access publication of our paper that lead to incorrect FPR results for DESeq2. Thanks for Mike, the bug was quickly spotted and corrected and the final publication version (which he has already linked to) correctly shows the DESeq2 performs well across all our benchmark tests.
Could I ask, where did you find the link to this old copy of the paper? It's certainly there (at the time of posting anyway, we're taking this old incorrect version down now), but I can't find any links to it from out homepage, arXiv, RNA, or a google search and we'd like correct the linking page to point at the final version of the paper.
Thank you for replying. I got the link because I am subscribed to Google Scholar alerts.