"U" shaped p values distribution in DESeq2 RNA-seq DE analysis using the contrast argument
Entering edit mode
Last seen 7.1 years ago

Dear Community,

I'm using DESeq2 to analyze an RNAseq experiement in which I'm evaluating the transcriptional variations mediated by the over expression of some transcription factors.

The experimental design is as follows: 5 conditions (ctrl; x; y; z; x+y+z), and each condition ha 5 replicates.

In a previous question (see the link: https://support.bioconductor.org/p/80178/) I've asked whether the use of the "contrast" argument was appropriate to examine the group-wise differences in expression, and apparently it is correct tool.

However, inspecting the p value histograms (same thing for the pAdj ones) all the comparisons give rise to "U" shaped distributions, with a strong bias towards the "1" limit of the distribution (from 1 to 10 times higher than of the "0" peak).

Here you have the code I use:

condition<-c("x","x","x","x","x","y","y","y","y","y","z","z","z","z","z","xyz","xyz","xyz","xyz","xyz","ctrl","ctrl","ctrl","ctrl","ctrl") #declare my conditions
ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory,design= ~ condition) #building the dds object
ddsHTSeqfilt <- ddsHTSeq[ rowSums(counts(ddsHTSeq)) > 1, ] #filtering 0 values
ddsHTSeqfilt$condition <- relevel(ddsHTSeqfilt$condition, ref="ctrl") # specifying to use ctrl as reference
dds <- DESeq(ddsHTSeqfilt)
DEx<-results(dds, contrast=c("condition","x","ctrl"), alpha=0.05, lfcThreshold = 1, altHypothesis = "greaterAbs", independentFiltering = T)
DEy<-results(dds, contrast=c("condition","y","ctrl"), alpha=0.05, lfcThreshold = 1, altHypothesis = "greaterAbs", independentFiltering = T)

DEz<-results(dds, contrast=c("condition","z","ctrl"), alpha=0.05, lfcThreshold = 1, altHypothesis = "greaterAbs", independentFiltering = T)

DExyz<-results(dds, contrast=c("condition","xyz","ctrl"), alpha=0.05, lfcThreshold = 1, altHypothesis = "greaterAbs", independentFiltering = T)


R version 3.2.4 Revised (2016-03-16 r70336)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
 [1] parallel  stats4    grid      stats     graphics  grDevices utils     datasets  methods  
[10] base     

other attached packages:
 [1] fdrtool_1.2.15             BiocInstaller_1.20.3       org.Mm.eg.db_3.2.3        
 [4] RSQLite_1.0.0              DBI_0.3.1                  AnnotationDbi_1.32.3      
 [7] genefilter_1.52.1          PoiClaClu_1.0.2            RColorBrewer_1.1-2        
[10] reshape2_1.4.1             DESeq2_1.10.1              RcppArmadillo_0.6.700.3.0 
[13] Rcpp_0.12.4                SummarizedExperiment_1.0.2 Biobase_2.30.0            
[16] GenomicRanges_1.22.4       GenomeInfoDb_1.6.3         IRanges_2.4.8             
[19] S4Vectors_0.8.11           BiocGenerics_0.16.1        ggplot2_2.1.0             
[22] pheatmap_1.0.8             rmarkdown_0.9.5            circlize_0.3.5            
[25] ComplexHeatmap_1.6.0       MASS_7.3-45                ReporteRs_0.8.6           
[28] ReporteRsjars_0.0.2        BiocParallel_1.4.3         FactoMineR_1.32           
[31] rJava_0.9-8               

As far as I understood, reading the nice tutorial from Bernd Klaus and Wolfgang Huber (link: http://www-huber.embl.de/users/klaus/Teaching/DESeq2Predoc2014.html#inspection-and-correction-of-pvalues), having a "U" skewed distribution of p values represents an underestimation of the gene-wise variance, and theoretically it should be possible to correct it using the "fdrtool" package. However this approach didn't work so far. Here you can see the code:

DEx <- DEx[ !is.na(DEx$padj), ]
DEx <- DEx[ !is.na(DEx$pvalue), ]
DEx <- DEx[, -which(names(DEx) == "padj")]
FDR.DEx <- fdrtool(DEx$stat, statistic= "normal", plot = T)

which ends up with the following error message:

Step 1... determine cutoff point
Step 2... estimate parameters of null distribution and eta0
Error in optimize(nlogL, lower = lo, upper = up) : 
  'xmin' not less than 'xmax'

I've inspected the data using PCA and several heat maps (for different genes sub-groups) and the samples have some obvious differences. However, I'm afraid to that having an abnormal "U" shaped p values distribution might lead to some misleading conclusion.

I'd really appreciate the help of the community here.

Thanks in advance for any help.

All the best.


deseq2 rnaseq contrast multiple comparisons fdrtool • 2.2k views
Entering edit mode
Last seen 2 hours ago
United States

The p-value distribution is not expected to be uniformly distributed when using the argument lfcThreshold, because we replace the composite null hypothesis with two simple null hypotheses. This procedure controls Type I error when the true LFC is at the border of the region defined by lfcThrehsold, and is conservative when the true LFC is within the region abs(LFC) < lfcThreshold. In other words, the spike at p value = 0 is expected given the way we have constructed the test, these are genes where the estimated LFC is falling in the center of the region, and it doesn't have to do with mis-estimation of the dispersion.

See the DESeq2 paper Methods section for "Composite null hypotheses"


Entering edit mode
Last seen 7.1 years ago

Thank you for the clarification Michael!


Login before adding your answer.

Traffic: 837 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6