After reading some post that recommend checking p-value distributions after DE testing, I went back and looked at some data where the result of DE testing was 0 significantly DE genes (3 replicates, 3 conditions, 3 times). Note that this was using an experiment where all samples were loaded into DESeq2 so that additional samples could be used for dispersion estimation. It turns out those p-value distributions were all hill shaped like so
This is not expected in a DE test the resulted in 0 genes, as the pvalues should be evenly distributed across all p-values in this type of result. Instead this seems to reflect an overestimated dispersion where many genes were assigned high pvalues because of their high dispersion values when their dispersion value should truly be lower or not gene-wise (group-wise).
Thus, I attempted to correct this using the recommended fdrtool and got a better looking p-value distribution (I think). When looking at DE, I strikingly also now had the result of over 1000 significant DE genes for the same comparison that just had 0.
To further see how dispersion might be affecting this, I decided to do the same test, but only loading into DESeq2 the samples which I wanted to compare (not the whole experiment). I had done this in the past when I was not correcting for the p-value distribution as above and got similar results to loading all samples into DESeq2.
The p-value distribution and correction looked similar to the histograms above, however, the new DE result contained less than 200 genes as compared to 1000 genes when loading all samples! When looking at a PCA plot, it would not appear that these samples need to be run separately.
My So my question is: Am I right by correcting p-values by examining p-value distributions? Why am I getting these huge differences when I do the correction, but am loading my samples into DESeq2 differently?