Hi, I used DESeq2 for several differential expression analyses for paired patient samples (sample sizes in the range of 6 to 34 patients). The unadjusted p-values (with applying independentFiltering = TRUE) of the DESeq2 analyses (both Wald test and time series experiments using LRT) always showed a U-shaped distribution. I could not figure out what is wrong with my data or my applied tests. Now, I filtered out genes where there are less than 10% of samples with raw counts greater than or equal to 5 prior to DESeq2 analysis. The distribution of p-values is not U-shaped anymore. Without filtering for low-count genes I had extremely many genes yielding p-values close to 1. Now this extreme is resolved. My question: (1) There was one comment from Micheal Love (https://support.bioconductor.org/p/65256/) saying that there is no need for pre-filtering if one uses independentFiltering = TRUE, but my analysis included independentFiltering = TRUE. Why does pre-filtering remove most of the 'close-to-1' p-values and change the shape of the p-value distribution in my analysis? (2) Pre-filtering for low-count genes seems to have two advantages for me: resolving the alarming U-shaped p-value distribution and slightly more significant genes, because multiple testing correction for fewer genes. Are there disadvantages of pre-filtering? Could this produce bias? What is troubling more: pre-filtering or U-shaped p-value distribution? (3) Would you recommend to apply the pre-filtering on the normalized counts instead of raw counts?
Below my code and the histograms
# without pre-filtering for low-count genes
ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory, design = ~ patient + condition)
ddsHTSeq <- ddsHTSeq[ rowSums(counts(ddsHTSeq)) > 1, ]
dds <- DESeq(ddsHTSeq, betaPrior = FALSE)
## 1 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest
ddsClean <- dds[which(mcols(dds)$betaConv),]
dds <- ddsClean
res <- results(dds, independentFiltering = TRUE, alpha = 0.05)
res <- lfcShrink(dds, coef="condition_1_vs_2", type="apeglm", res = res) # not p-value-relevant
summary(res)
out of 37440 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 4826, 13%
LFC < 0 (down) : 5104, 14%
outliers [1] : 0, 0%
low counts [2] : 13066, 35%
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
hist(res$pvalue ,
col = "grey", border = "white", xlab = "", ylab = "",
main = "frequencies of p-values")
#-------------------
# with pre-filtering for low-count genes
ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory, design = ~ patient + condition)
nrow(sampleTable)
68
ddsHTSeq <- ddsHTSeq[ (rowSums(counts(ddsHTSeq) >= 5 ) >= 7),] # 7 samples are 10% of my samples
dds <- DESeq(ddsHTSeq, betaPrior = FALSE)
## 5 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest
ddsClean <- dds[which(mcols(dds)$betaConv),]
dds <- ddsClean
res <- results(dds, independentFiltering = TRUE, alpha = 0.05)
res <- lfcShrink(dds, coef="condition_1_vs_2", type="apeglm", res = res) # not p-value-relevant
summary(res)
out of 22339 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 4856, 22%
LFC < 0 (down) : 5137, 23%
outliers [1] : 0, 0%
low counts [2] : 0, 0%
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
hist(res$pvalue ,
col = "grey", border = "white", xlab = "", ylab = "",
main = "frequencies of p-values")
I appreciate every hint, thank you in advance!
> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0
locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 LC_MONETARY=de_DE.UTF-8
[6] LC_MESSAGES=de_DE.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] DESeq2_1.18.1 SummarizedExperiment_1.8.1 DelayedArray_0.4.1 matrixStats_0.54.0 Biobase_2.38.0
[6] GenomicRanges_1.30.3 GenomeInfoDb_1.14.0 IRanges_2.12.0 S4Vectors_0.16.0 BiocGenerics_0.24.0
loaded via a namespace (and not attached):
[1] bit64_0.9-7 splines_3.4.4 Formula_1.2-3 assertthat_0.2.0 latticeExtra_0.6-28 blob_1.1.1
[7] GenomeInfoDbData_1.0.0 pillar_1.3.1 RSQLite_2.1.1 backports_1.1.2 lattice_0.20-35 glue_1.3.0
[13] digest_0.6.18 RColorBrewer_1.1-2 XVector_0.18.0 checkmate_1.8.5 colorspace_1.3-2 htmltools_0.3.6
[19] Matrix_1.2-14 plyr_1.8.4 XML_3.98-1.16 pkgconfig_2.0.2 genefilter_1.60.0 zlibbioc_1.24.0
[25] purrr_0.2.5 xtable_1.8-2 scales_1.0.0 BiocParallel_1.12.0 htmlTable_1.12 tibble_1.4.2
[31] annotate_1.56.2 ggplot2_3.1.0 nnet_7.3-12 lazyeval_0.2.1 survival_2.43-3 magrittr_1.5
[37] crayon_1.3.4 memoise_1.1.0 foreign_0.8-71 tools_3.4.4 data.table_1.11.4 stringr_1.3.1
[43] locfit_1.5-9.1 munsell_0.5.0 cluster_2.0.7-1 AnnotationDbi_1.40.0 bindrcpp_0.2.2 compiler_3.4.4
[49] rlang_0.3.0.1 grid_3.4.4 RCurl_1.95-4.11 rstudioapi_0.7 htmlwidgets_1.2 bitops_1.0-6
[55] base64enc_0.1-3 gtable_0.2.0 DBI_1.0.0 R6_2.2.2 gridExtra_2.3 knitr_1.20
[61] dplyr_0.7.8 bit_1.1-14 bindr_0.1.1 Hmisc_4.1-1 stringi_1.2.4 Rcpp_1.0.0
[67] geneplotter_1.56.0 rpart_4.1-13 acepack_1.4.1 tidyselect_0.2.5
Yes, this answers my question, thank you Michael! Sorry, I was not aware that independent filtering only has influence on the adjusted p-values.