Question

fdrtool - correction of p-values - course found on huber.embl.de

5

Entering edit mode

j.bergenstrahle ▴ 90

@jbergenstrahle-12485

Last seen 7.6 years ago

Hello,

I originally emailed this question to c Klaus at EMBL and got a really good and well written response. I promised that I should post the question and answer here since more people might find it interesting, especially the approach one might take to explore your data if p-value distributions looks "hill-shaped".

Ok here it goes:

I was doing DE analysis with DESeq2 and performed a lot of different comparions between conditions. However, for some of the comparisons I got "hill-shaped" p-value distributions. Searching for an answer I found this tutorial over at embl website: http://www-huber.embl.de/users/klaus/Teaching/DESeq2Predoc2014.html which make a p-value correction with the r-package "fdrtool". One possible reason for obtaining "hill-shaped" p-value distributions is that you have an unequal variance within the treatment-groups that you are comparing, and due to the fact that DESeq2 calculates "one dispersion per gene across both treatments", the result can be an underestimation of significance.

My question was that I wanted to "conclude" that this is the case - i.e. do I have large variances within my groups that cause this problem. Also, after using the fdrtool I got "healthy"-looking pvalue distributions in all cases except one - why is that?

Bernd answered pedagogically with an r-script (attached below) and the following text:

I.) See section 1 of the attached R script, where I try to identify
different dispersions in two experimental groups. Briefly, the
strategy is to take 1/baseMean + dispersion as a proxy for the
variance and then take the 4th root of this quantity as this makes
it more normally distributed.

I then run a paired t-test to check whether there are global
differences in variance between the conditions.

Note however, that this is not so reliable if you have only a few
replicates in each experimental group, I would recommend at least
5 samples per group to avoid false positives.

II.) Histogram shapes like the one you observe can also be caused by
a batch effect that is overlapping experimental groups. See
section of the attached script.

III.) Peaks in p-value histogram often correspond to genes that
have a low number of counts overall. You can filter them beforehand.

The R-script that Bernd provided is shown below:

---------------------------------------------------------------------

# 1.) Comparing within group variability

# caveat: for low samples size, variability in dispersion estimates is substantial

set.seed(12345)

library(DESeq2)

# object construction, 2 conditions, we want to check for differntial
# variability between the two conditions
dds <- makeExampleDESeqDataSet(n = 1000, m = 20)

# split it into two data sets, one per condition

dds_A <- dds[, colData(dds)$condition == "A"]
design(dds_A) <- formula(~ 1)

dds_B <- dds[, colData(dds)$condition == "B"]
design(dds_B)   <- formula(~ 1)

# standard analysis
dds_A <- DESeq(dds_A )
res_A  <- results(dds_A)

dds_B <- DESeq(dds_B )
res_B  <- results(dds_B)

# Variance on the natural log scale is approx ~ 1/baseMean + dispersion
# so on the log2 scale (the scale used for rlog etc in DESeq2) it
# is ( 1/ log2(exp(1)) )^2 * ( 1/baseMean + dispersion  )

var_A <- ( 1/ log2(exp(1)) )^2 * (1 / rowData(dds_A)$baseMean + rowData(dds_A)$dispersion)
var_B <- ( 1/ log2(exp(1)) )^2 * (1 / rowData(dds_B)$baseMean + rowData(dds_B)$dispersion)

# Now take the 4th root of the variances
# just like in the limma voom paper!
# https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29

score_A <-  var_A^0.25
score_B <-  var_B^0.25

library(tidyverse)

# Difference in scores cetner around zero
qplot(score_A - score_B, geom = "dotplot", binwidth = 0.008)
qplot(score_A - score_B, geom = "density")

# overall significance:
t.test(score_A, score_B, paired =  TRUE)
# no significance!

# 2. Batch effects overlapping experimental groups can lead to histograms like
# the ones you observe in your data
library(genefilter)

n = 10000
m = 20
x = matrix(rnorm(n*m), nrow=n, ncol=m)
fac = factor(c(rep(0, 10), rep(1, 10)))
rt1 = rowttests(x, fac)

qplot(rt1$p.value, fill = I("tan3"))

x[, 6:15] = x[, 6:15]+1
rt2 = rowttests(x, fac)

qplot(rt2$p.value, fill = I("coral3"))

---------------------------------------------------------------------------------------

As it happens, for my particular data, this answered part of the questions but not all. And after some more conversation Bernd also showed an approach for looking at potential batch efffects across your conditions/treatment-groups that are hidden by the dominant patient-to-patient variability. (This works conceptually like removeBatchEffect in limma).

You can use an approach like this to remove e.g. the patient-effect beforehand. Obs, you should not use the cleaned data for final DE analysis, but it could aid you in spotting culprits in your data.

----------------------------------------------------------------------------------------

set.seed(777)
library(DESeq2)
library(ggplot2)
library(matrixStats)


### create example data set with differential expression between the two groups
dds <- makeExampleDESeqDataSet(m = 8, betaSD = 2)

### estimate size factors etc
dds <- DESeq(dds)
sf_s <- sizeFactors(dds)
## add sex as a batch effect, there are two males and two females in each group
colData(dds)$sex <- factor(rep(c("m", "f"), 4))

## modify counts to add a batch effect, we add normally distributed random noise
## with mean 2 to randomly selected genes of male samples and then round the result
cts <- counts(dds, normalized = TRUE)
ranGenes <- floor(runif(300)*1000)

for (i in ranGenes){
  cts[i, colData(dds)$sex == "m"] <- cts[i, colData(dds)$sex == "m"] + round(rnorm(1,4, sd = 1))
}

counts(dds) <- apply(round(t(t(cts) * sf_s)), 2, as.integer)

## produce PCA plot to see the batch effects
rld <- rlogTransformation(dds, blind=TRUE)

ntop = 500

Pvars <- rowVars(assay(rld))
select <- order(Pvars, decreasing = TRUE)[seq_len(min(ntop,
                                                      length(Pvars)))]

PCA <- prcomp(t(assay(rld)[select, ]), scale = FALSE)
percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1)

dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2],
                    PC3 = PCA$x[,3], PC4 = PCA$x[,4],
                    sex = colData(rld)$sex,
                    condition = colData(rld)$condition)

(qplot(PC1, PC2, data = dataGG, color =  condition, shape = sex,
       main = "PC1 vs PC2, top variable genes", size = I(6))
+ labs(x = paste0("PC1, VarExp:", round(percentVar[1],4)),
       y = paste0("PC2, VarExp:", round(percentVar[2],4)))
+ scale_colour_brewer(type="qual", palette=2)
)

## the first PC shows a clear group separation, while the second
## PC shows a clear clustering by sex

# We now run the DESeq computations and
## extract the fitted sex coefficient to remove the batch effect
design(dds) <- ~ sex + condition
dds <- DESeq(dds)

## display the names of the fitted coefficients
resultsNames(dds)

## Note that DESeq2 will fit a coefficient for each group, which
## creates non-full rank design matrices. However, due to the shrinkage estimation (beta-shrinkage),
## this is not a problem

## extract means for females on the original scale, they  scatter around 1
## make sure to turn off outlier detection and independent filtering
## to avoid NAs
mean.f <- results(dds, contrast = c(0,1,0,0,0),
                  independentFiltering = FALSE,
                  cooksCutoff = FALSE)$log2FoldChange

hist(2^mean.f, col = "darkgreen", main = "Female effects")

## extract means for males on the log2 scale, we  see a strong right hand side tail
## that corresponds ot the genes affected by the sex effect.
mean.m <- results(dds, contrast = c(0,0,1,0,0),
                  independentFiltering = FALSE,
                  cooksCutoff = FALSE)$log2FoldChange
hist(2^mean.m, col = "lavender", main = "Male effects")

## remaining NAs correspond to genes with zero counts in all samples
## We thus set the NAs to zero

mean.m[is.na(mean.m)] <- 0
mean.f[is.na(mean.f)] <- 0

## now remove the batch effects
## use normalized counts for removal, as the coefficients are estimated on
## the normalized counts
cts_corrected <- counts(dds, normalized = TRUE)

  for(i in 1:1000){
  cts_corrected[i, colData(dds)$sex == "f"] <- as.integer(round(counts(dds, normalized = TRUE)[i, colData(dds)$sex == "f"]/2^mean.f[i]))
  cts_corrected[i, colData(dds)$sex == "m"] <- as.integer(round(counts(dds, normalized = TRUE)[i, colData(dds)$sex == "m"]/2^mean.m[i]))
  }

## replace the original with the corrected counts,  undo the
## normalization
counts(dds) <- apply(round(t(t(cts_corrected) * sf_s)), 2, as.integer)
## now produce PCA plot again

## produce PCA plot to see the batch effects
rld <- rlogTransformation(dds, blind=TRUE)

ntop = 500

Pvars <- rowVars(assay(rld))
select <- order(Pvars, decreasing = TRUE)[seq_len(min(ntop,
                                                      length(Pvars)))]

PCA <- prcomp(t(assay(rld)[select, ]), scale = FALSE)
percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1)

dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2],
                    PC3 = PCA$x[,3], PC4 = PCA$x[,4],
                    sex = colData(rld)$sex,
                    condition = colData(rld)$condition)

(qplot(PC1, PC2, data = dataGG, color =  condition, shape = sex,
       main = "PC1 vs PC2, top variable genes, corrected data", size = I(6))
+ labs(x = paste0("PC1, VarExp:", round(percentVar[1],4)),
       y = paste0("PC2, VarExp:", round(percentVar[2],4)))
+ scale_colour_brewer(type="qual", palette=2)
)

## The correction was successful, the second PC no longer separates
## samples  according to sex

---------------------------------------------------------------------

fdrtool p-value distribution • 2.8k views

ADD COMMENT • link 8.3 years ago j.bergenstrahle ▴ 90

0

Entering edit mode

Maybe I should "mark" this question as completed? (Not sure the appropriate approach when posting something like this?)

ADD REPLY • link 8.3 years ago j.bergenstrahle ▴ 90