Hello,
I originally emailed this question to c Klaus at EMBL and got a really good and well written response. I promised that I should post the question and answer here since more people might find it interesting, especially the approach one might take to explore your data if p-value distributions looks "hill-shaped".
Ok here it goes:
I was doing DE analysis with DESeq2 and performed a lot of different comparions between conditions. However, for some of the comparisons I got "hill-shaped" p-value distributions. Searching for an answer I found this tutorial over at embl website: http://www-huber.embl.de/users/klaus/Teaching/DESeq2Predoc2014.html which make a p-value correction with the r-package "fdrtool". One possible reason for obtaining "hill-shaped" p-value distributions is that you have an unequal variance within the treatment-groups that you are comparing, and due to the fact that DESeq2 calculates "one dispersion per gene across both treatments", the result can be an underestimation of significance.
My question was that I wanted to "conclude" that this is the case - i.e. do I have large variances within my groups that cause this problem. Also, after using the fdrtool I got "healthy"-looking pvalue distributions in all cases except one - why is that?
Bernd answered pedagogically with an r-script (attached below) and the following text:
I.) See section 1 of the attached R script, where I try to identify
different dispersions in two experimental groups. Briefly, the
strategy is to take 1/baseMean + dispersion as a proxy for the
variance and then take the 4th root of this quantity as this makes
it more normally distributed.
I then run a paired t-test to check whether there are global
differences in variance between the conditions.
Note however, that this is not so reliable if you have only a few
replicates in each experimental group, I would recommend at least
5 samples per group to avoid false positives.
II.) Histogram shapes like the one you observe can also be caused by
a batch effect that is overlapping experimental groups. See
section of the attached script.
III.) Peaks in p-value histogram often correspond to genes that
have a low number of counts overall. You can filter them beforehand.
The R-script that Bernd provided is shown below:
--------------------------------------------------------------------- # 1.) Comparing within group variability # caveat: for low samples size, variability in dispersion estimates is substantial set.seed(12345) library(DESeq2) # object construction, 2 conditions, we want to check for differntial # variability between the two conditions dds <- makeExampleDESeqDataSet(n = 1000, m = 20) # split it into two data sets, one per condition dds_A <- dds[, colData(dds)$condition == "A"] design(dds_A) <- formula(~ 1) dds_B <- dds[, colData(dds)$condition == "B"] design(dds_B) <- formula(~ 1) # standard analysis dds_A <- DESeq(dds_A ) res_A <- results(dds_A) dds_B <- DESeq(dds_B ) res_B <- results(dds_B) # Variance on the natural log scale is approx ~ 1/baseMean + dispersion # so on the log2 scale (the scale used for rlog etc in DESeq2) it # is ( 1/ log2(exp(1)) )^2 * ( 1/baseMean + dispersion ) var_A <- ( 1/ log2(exp(1)) )^2 * (1 / rowData(dds_A)$baseMean + rowData(dds_A)$dispersion) var_B <- ( 1/ log2(exp(1)) )^2 * (1 / rowData(dds_B)$baseMean + rowData(dds_B)$dispersion) # Now take the 4th root of the variances # just like in the limma voom paper! # https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29 score_A <- var_A^0.25 score_B <- var_B^0.25 library(tidyverse) # Difference in scores cetner around zero qplot(score_A - score_B, geom = "dotplot", binwidth = 0.008) qplot(score_A - score_B, geom = "density") # overall significance: t.test(score_A, score_B, paired = TRUE) # no significance! # 2. Batch effects overlapping experimental groups can lead to histograms like # the ones you observe in your data library(genefilter) n = 10000 m = 20 x = matrix(rnorm(n*m), nrow=n, ncol=m) fac = factor(c(rep(0, 10), rep(1, 10))) rt1 = rowttests(x, fac) qplot(rt1$p.value, fill = I("tan3")) x[, 6:15] = x[, 6:15]+1 rt2 = rowttests(x, fac) qplot(rt2$p.value, fill = I("coral3"))
---------------------------------------------------------------------------------------
As it happens, for my particular data, this answered part of the questions but not all. And after some more conversation Bernd also showed an approach for looking at potential batch efffects across your conditions/treatment-groups that are hidden by the dominant patient-to-patient variability. (This works conceptually like removeBatchEffect in limma).
You can use an approach like this to remove e.g. the patient-effect beforehand. Obs, you should not use the cleaned data for final DE analysis, but it could aid you in spotting culprits in your data.
----------------------------------------------------------------------------------------
set.seed(777) library(DESeq2) library(ggplot2) library(matrixStats) ### create example data set with differential expression between the two groups dds <- makeExampleDESeqDataSet(m = 8, betaSD = 2) ### estimate size factors etc dds <- DESeq(dds) sf_s <- sizeFactors(dds) ## add sex as a batch effect, there are two males and two females in each group colData(dds)$sex <- factor(rep(c("m", "f"), 4)) ## modify counts to add a batch effect, we add normally distributed random noise ## with mean 2 to randomly selected genes of male samples and then round the result cts <- counts(dds, normalized = TRUE) ranGenes <- floor(runif(300)*1000) for (i in ranGenes){ cts[i, colData(dds)$sex == "m"] <- cts[i, colData(dds)$sex == "m"] + round(rnorm(1,4, sd = 1)) } counts(dds) <- apply(round(t(t(cts) * sf_s)), 2, as.integer) ## produce PCA plot to see the batch effects rld <- rlogTransformation(dds, blind=TRUE) ntop = 500 Pvars <- rowVars(assay(rld)) select <- order(Pvars, decreasing = TRUE)[seq_len(min(ntop, length(Pvars)))] PCA <- prcomp(t(assay(rld)[select, ]), scale = FALSE) percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1) dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2], PC3 = PCA$x[,3], PC4 = PCA$x[,4], sex = colData(rld)$sex, condition = colData(rld)$condition) (qplot(PC1, PC2, data = dataGG, color = condition, shape = sex, main = "PC1 vs PC2, top variable genes", size = I(6)) + labs(x = paste0("PC1, VarExp:", round(percentVar[1],4)), y = paste0("PC2, VarExp:", round(percentVar[2],4))) + scale_colour_brewer(type="qual", palette=2) ) ## the first PC shows a clear group separation, while the second ## PC shows a clear clustering by sex # We now run the DESeq computations and ## extract the fitted sex coefficient to remove the batch effect design(dds) <- ~ sex + condition dds <- DESeq(dds) ## display the names of the fitted coefficients resultsNames(dds) ## Note that DESeq2 will fit a coefficient for each group, which ## creates non-full rank design matrices. However, due to the shrinkage estimation (beta-shrinkage), ## this is not a problem ## extract means for females on the original scale, they scatter around 1 ## make sure to turn off outlier detection and independent filtering ## to avoid NAs mean.f <- results(dds, contrast = c(0,1,0,0,0), independentFiltering = FALSE, cooksCutoff = FALSE)$log2FoldChange hist(2^mean.f, col = "darkgreen", main = "Female effects") ## extract means for males on the log2 scale, we see a strong right hand side tail ## that corresponds ot the genes affected by the sex effect. mean.m <- results(dds, contrast = c(0,0,1,0,0), independentFiltering = FALSE, cooksCutoff = FALSE)$log2FoldChange hist(2^mean.m, col = "lavender", main = "Male effects") ## remaining NAs correspond to genes with zero counts in all samples ## We thus set the NAs to zero mean.m[is.na(mean.m)] <- 0 mean.f[is.na(mean.f)] <- 0 ## now remove the batch effects ## use normalized counts for removal, as the coefficients are estimated on ## the normalized counts cts_corrected <- counts(dds, normalized = TRUE) for(i in 1:1000){ cts_corrected[i, colData(dds)$sex == "f"] <- as.integer(round(counts(dds, normalized = TRUE)[i, colData(dds)$sex == "f"]/2^mean.f[i])) cts_corrected[i, colData(dds)$sex == "m"] <- as.integer(round(counts(dds, normalized = TRUE)[i, colData(dds)$sex == "m"]/2^mean.m[i])) } ## replace the original with the corrected counts, undo the ## normalization counts(dds) <- apply(round(t(t(cts_corrected) * sf_s)), 2, as.integer) ## now produce PCA plot again ## produce PCA plot to see the batch effects rld <- rlogTransformation(dds, blind=TRUE) ntop = 500 Pvars <- rowVars(assay(rld)) select <- order(Pvars, decreasing = TRUE)[seq_len(min(ntop, length(Pvars)))] PCA <- prcomp(t(assay(rld)[select, ]), scale = FALSE) percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1) dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2], PC3 = PCA$x[,3], PC4 = PCA$x[,4], sex = colData(rld)$sex, condition = colData(rld)$condition) (qplot(PC1, PC2, data = dataGG, color = condition, shape = sex, main = "PC1 vs PC2, top variable genes, corrected data", size = I(6)) + labs(x = paste0("PC1, VarExp:", round(percentVar[1],4)), y = paste0("PC2, VarExp:", round(percentVar[2],4))) + scale_colour_brewer(type="qual", palette=2) ) ## The correction was successful, the second PC no longer separates ## samples according to sex
---------------------------------------------------------------------
Maybe I should "mark" this question as completed? (Not sure the appropriate approach when posting something like this?)