Question

DeSeq2 - hill–shaped histogram of pvalues

3

Entering edit mode

deut13 ▴ 80

@deut13-8633

Last seen 4.7 years ago

Germany

Hi,

a very interesting tutorial on using DeSeq2 (http://www-huber.embl.de/users/klaus/Teaching/DESeq2Predoc2014.html) suggests using fdrtool to re-estimate pvalues in case the histogram has a hill-shape, indicating an overestimation of the variance of the null distribution.

My question is: when by following this procedure the estimated null model variance turns out to be lower than 1, does it mean that overestimation of the variance is indeed the most likely cause of the hill shape and thus the best way to extract the differentially expressed genes is to use the re-estimated pvalues and corresponding adjusted pvalues by fdrtool?

Thanks!

Fulvia

deseq2 pvalue histogram • 5.9k views

ADD COMMENT • link updated 8.7 years ago by Bernd Klaus ▴ 610 • written 8.7 years ago by deut13 ▴ 80

score 3 · Answer 1 · 2015-08-26

Hi Fluvia,

if you get a hill shaped histogram, it simply means that the p-values are not "well calibrated".
It is a bit more involved to explain this more or less properly, so sorry in advance for this
rather long post.

In a typical RNA-Seq experiment, most genes will not be (strongly) differentially expressed, and the p-values for them should then follow a uniform distribution, while the p-values for the differentially expressed genes will be enriched for low p-values.

Now a simply way to formalize this is to think of the Wald test z-scores form DESeq as following
a two groups model, with a null and an alternative distribution. Now the "theoritical null" should
be a N(0,1) distribution, but sometimes this need not be the case and one gets too p--values that are too high in lower end range leading to a hill--shaped p--value histogram.

See section 5 of

http://www-huber.embl.de/users/klaus/Teaching/Testing-lab.pdf

For a simulated example that should make this clear. See

http://dx.doi.org/10.1515/sagmb-2013-0074

For a current paper discussing this phenomenon.

Now a simple solution to this problem (also followed in the SAGMB paper)
is to include a variance parameter sigma in the null distribution for the z-scores and
model it as N(0,sigma). If you have a hill shaped p-value histogram, the theoretical
sigma of 1 is most probably to high and now fdrtool can be used to estimate the sigma
parameter from the data, leading to an estimate less than 1.

See Efron's review article

http://arxiv.org/abs/0808.0572

and section 6 of his book "Large-Scale Inference
Empirical Bayes Methods for Estimation, Testing, and Prediction" for details on empirical null estimation.

Another important question is now where the problem with the "incorrect" p-values
actually come from.

My guess is that this often stems from the fact that DESeq2, just
like LIMMA, models one variance parameter per gene (the dispersion). This implicitly assumes that the within group variability is not different between the groups. So if this is not correct, since
e.g. the control group is less variable than the treatment group(s), it might lead to
wrong p-values. If the sample size is quite low, also outlying samples might also cause
the phenomenon.

Now one could model different variance parameters, but this becomes much harder
to estimate, so it is hard to do automatically in a software package. For an analysis that uses
multiple dispersion parameters per feature, see:

http://www-huber.embl.de/pub/pdf/Reyes2013.pdf

So it is easier just to re-estimate the null model from the test statistics and then use
this empirical null model to compute the p-values.

Best wishes,

Bernd