Hi I am comparing RNA-seq between two conditions (WT vs. KO) using DESeq2. I have two different distribution of p-values depending on the cut-off for genes based on counting and application of fdrtool. I would like to listen to your advice for which is a correct way to go.
I used the following code to generate figures:
> dds<-dds[rowSums(counts(dds))>=count.th,] > dds<-DESeq(dds) > res<-results(dds,contrast = c("Group","KO","WT")) > hist(res$pvalue) # for figure A and C > res.emp <- res > res.emp <- res.emp[ !is.na(res.emp$pvalue), ] > emp.pval <- fdrtool(res.emp$stat, statistic= "normal", plot = T) > hist(emp.pval$pval) # figure B and D
In the attached figure (link), A and B were generated by count.th =1 with A before fdrtool, B after fdrtool. C and D were generated by count.th=10, with C before fdrtool, D after fdrtool.
A and B produced some number (~10) of significant genes (padj<0.01) while C and D produced no significant genes (padj<0.01).
> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.6
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils
[7] datasets methods base
other attached packages:
[1] fdrtool_1.2.15 DESeq2_1.14.1
[3] SummarizedExperiment_1.4.0 Biobase_2.34.0
[5] GenomicRanges_1.26.4 GenomeInfoDb_1.10.3
[7] IRanges_2.8.2 S4Vectors_0.12.2
[9] BiocGenerics_0.20.0 reshape2_1.4.2
[11] pheatmap_1.0.8 ggplot2_2.2.1
loaded via a namespace (and not attached):
[1] genefilter_1.56.0 locfit_1.5-9.1 splines_3.3.3
[4] lattice_0.20-34 colorspace_1.3-2 htmltools_0.3.6
[7] base64enc_0.1-3 blob_1.1.0 survival_2.40-1
[10] XML_3.98-1.9 rlang_0.1.2 foreign_0.8-67
[13] DBI_0.7 BiocParallel_1.8.2 bit64_0.9-7
[16] RColorBrewer_1.1-2 plyr_1.8.4 stringr_1.2.0
[19] zlibbioc_1.20.0 munsell_0.4.3 gtable_0.2.0
[22] htmlwidgets_0.9 memoise_1.1.0 latticeExtra_0.6-28
[25] knitr_1.17 geneplotter_1.52.0 AnnotationDbi_1.36.2
[28] htmlTable_1.9 Rcpp_0.12.13 acepack_1.4.1
[31] xtable_1.8-2 scales_0.5.0 backports_1.1.1
[34] checkmate_1.8.5 Hmisc_4.0-3 annotate_1.52.1
[37] XVector_0.14.1 bit_1.1-12 gridExtra_2.3
[40] digest_0.6.12 stringi_1.1.5 grid_3.3.3
[43] tools_3.3.3 bitops_1.0-6 magrittr_1.5
[46] lazyeval_0.2.1 RCurl_1.95-4.8 tibble_1.3.4
[49] RSQLite_2.0 Formula_1.2-2 cluster_2.0.5
[52] Matrix_1.2-8 data.table_1.10.4-3 rpart_4.1-10
[55] nnet_7.3-12
Hi Michael,
Thanks for your answer. You are right. In PCA plot, the samples are intermingled regardless of conditions. I understand that there are few significant genes in this case. However, while I expect that the distribution of p-value is almost flat in the case, I currently have some depletion around 0 in the plot of p-value distribution as you see figure (c) and (d). Could you explain me how to understand this?
It could be from unaccounted-for heterogeneity, and there are packages that help you estimate surrogate variables or factors of unwanted variation.
In general, I don't like post-processing the p-values to induce some differential results unless there's a clear case for it, and that p-value histogram isn't a good case for it.
Thanks for your help. I agree with you. I don't want to do post-processing the p-values to have significance. I can conclude that they do not have significant genes. But it is still unclear that why the distribution of p-value is not flat in the case of no significant gene. If I understand you correctly, if we have unaccounted heterogeneity, it is possible for the distribution of p-values to have depletion near zero. Could you confirm my understanding?
Yes this can induce non uniform null.