Question

DESeq2 dispersion and p-value histograms

2

Entering edit mode

rrcutler ▴ 70

@rrcutler-10667

Last seen 4.2 years ago

After reading some post that recommend checking p-value distributions after DE testing, I went back and looked at some data where the result of DE testing was 0 significantly DE genes (3 replicates, 3 conditions, 3 times). Note that this was using an experiment where all samples were loaded into DESeq2 so that additional samples could be used for dispersion estimation. It turns out those p-value distributions were all hill shaped like so

This is not expected in a DE test the resulted in 0 genes, as the pvalues should be evenly distributed across all p-values in this type of result. Instead this seems to reflect an overestimated dispersion where many genes were assigned high pvalues because of their high dispersion values when their dispersion value should truly be lower or not gene-wise (group-wise).

Thus, I attempted to correct this using the recommended fdrtool and got a better looking p-value distribution (I think). When looking at DE, I strikingly also now had the result of over 1000 significant DE genes for the same comparison that just had 0.

To further see how dispersion might be affecting this, I decided to do the same test, but only loading into DESeq2 the samples which I wanted to compare (not the whole experiment). I had done this in the past when I was not correcting for the p-value distribution as above and got similar results to loading all samples into DESeq2.

The p-value distribution and correction looked similar to the histograms above, however, the new DE result contained less than 200 genes as compared to 1000 genes when loading all samples! When looking at a PCA plot, it would not appear that these samples need to be run separately.

My So my question is: Am I right by correcting p-values by examining p-value distributions? Why am I getting these huge differences when I do the correction, but am loading my samples into DESeq2 differently?

deseq2 rnaseq fdrtool dispersion • 6.7k views

ADD COMMENT • link updated 6.5 years ago by Bernd Klaus ▴ 610 • written 6.5 years ago by rrcutler ▴ 70

score 2 · Answer 1 · 2017-10-30

Hi Ronald,

as Mike said, there is always some "detective work" involved on determining why a p-value distribution doesn't look like it "should".

In my workflow, I validated the correction by looking at the overlap of the results of the original analysis with my corrected list, where I find 105 out of the 116 genes at an FDR of 0.1. So I can see that basically some power has been gained by reestimating the null model.

The main problem is that a hill shaped p-value histogram can also arise when there is a batch effect overlapping the experimental groups, se e.g. the following simulated example, where group 2 and sample 5:10 of group 1 are contained in a different batch):

library(tidyverse)
library(genefilter)

n <- 10000
m <- 20
x <- matrix(rnorm(n*m), nrow = n, ncol = m)
fac <- factor(c(rep(0, 10), rep(1, 10)))
rt1 <- rowttests(x, fac)

qplot(rt1$p.value, fill = I("tan3"))

x[, 6:15] <- x[, 6:15]+5
rt2 <- rowttests(x, fac)

qplot(rt2$p.value, fill = I("coral3"))

So I have the following suggestions (You can try any subset of them):

1.) Consider Pre-filtering your features as there are two peaks in the histogram near 0.8 that probably correspond to features with low counts that then always get a highly similar p-value

2.) Run DESeq2 on each time point separately and check the estimated dispersion prior variance via

attr(dispersionFunction(dds), "dispPriorVar")

Are these very different between the timepoints? If yes, this might be similar to the "different dispersion within the experimental groups " case described in the FAQ of the vignette.

However, note that this assumes that the actual dispersions trends are similar so you also compare two separate time points' dispersion values directly. They can be accessed via rowData(dds)$dispersion

And you could e.g. look at a scatterplot of log10(rowData(dds)$dispersion) in time 1 vs time 2

3.) If you let SVA estimate a surrogate variable and run DESeq2 adding it in your design, does it fix the problem?

(http://bioconductor.org/help/workflows/rnaseqGene/#removing-hidden-batch-effects)

4.) When running LRTs on hypothesis you are interested in (as suggested by Mike), what do the p-value histograms look like?

5.) Plot the raw data for some hits after using the p-value correction, do their time courses show sufficient change given the data variability? (the function plotCounts might be useful)

If the variance of the null distribution is too low there will be many significant genes with small effect sizes.

score 0 · Answer 2 · 2017-10-27

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 3 hours ago

United States

What are the colors in the PCA plot?

Remember the PCA is just a 2 dimensional representation of a much higher dimensional space. You can't necessarily tell from a PCA plot if some subset of the genes has lower dispersion for your additional samples. Adding samples with lower dispersion for a subset of genes would lower the dispersion estimate across all samples, and could easily move some genes from one side of the significance threshold to the other (in addition, remember that the adjusted p-values are interdependent, so being among lower p-values "helps" as far as the adjustment for a given p-value). So I wouldn't be surprised that the number of DE genes changes after adding a third batch of samples. It's hard for me to say in a general way whether users should add "additional" samples if they are focused on a pair of conditions. It's too experiment-specific for me to make a general recommendation, so I just have the FAQ in the vignette as rough guidance on when it should not be done.

For me, it's too difficult to diagnose over the support forum when adjustment of p-values based on the histogram is potentially going around other issues that should be dealt with more directly (was it an LFC threshold test, in which case we do not expect a uniform distribution for the null p-values? are there unmodeled batch variables that would be better dealt with using a method like sva or RUV?), such that I don't include it as part of the recommended pipeline. Bernd can offer more guidance here perhaps.

ADD COMMENT • link 6.5 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

Thanks for the response. The colors in the PCA plot represent replicates of a condition. Each cluster is representative of a different time point. I have an equal interest in all the comparisons using all samples, so is loading all in fact appropriate here? Are there any test we can do to determine if it is or not?

From what I can gather, the reason I am getting very different results when either including or not including all samples for an experiment and then correcting the pvalues using the fdrtool is because dispersion values differ in each cluster (time point), so adding more samples from a different cluster(where global gene expression levels differ), is changing this estimate, which may not reflect the true dispersion. Given this, I am thinking it would be more appropriate to load only the sample from the same time point (although this does not allow to control for interactions interactions between condition and time).

The test used was the default Wald test, so I'm assuming that what is expected is a uniform distribution of p-values. There are no batch effects that I am aware of (I would imagine I could see that in the PCA), though I have made an attempt to detect any unknown effect.

As always, I appreciate your help.

-R

ADD REPLY • link 6.5 years ago rrcutler ▴ 70

1

Entering edit mode

re: "The test used was the default Wald test"

It sounds like you may be more interested in an LRT, given that you have different conditions and different time points. Maybe we can start from the beginning and I can help figure out an appropriate full and reduced design, and then you can see the distribution of p-values for the LRT, or Wald tests for coefficients within the design with interaction terms.

ADD REPLY • link 6.5 years ago Michael Love 41k