Question: U-shape of p-value histograms is removed by filtering for genes that have >5 counts in >10% of samples before applying DESeq2 analysis
0
gravatar for stu111538
8 months ago by
stu1115380
Germany, Kiel, University Hospital Kiel
stu1115380 wrote:

Hi, I used DESeq2 for several differential expression analyses for paired patient samples (sample sizes in the range of 6 to 34 patients). The unadjusted p-values (with applying independentFiltering = TRUE) of the DESeq2 analyses (both Wald test and time series experiments using LRT) always showed a U-shaped distribution. I could not figure out what is wrong with my data or my applied tests. Now, I filtered out genes where there are less than 10% of samples with raw counts greater than or equal to 5 prior to DESeq2 analysis. The distribution of p-values is not U-shaped anymore. Without filtering for low-count genes I had extremely many genes yielding p-values close to 1. Now this extreme is resolved. My question: (1) There was one comment from Micheal Love (https://support.bioconductor.org/p/65256/) saying that there is no need for pre-filtering if one uses independentFiltering = TRUE, but my analysis included independentFiltering = TRUE. Why does pre-filtering remove most of the 'close-to-1' p-values and change the shape of the p-value distribution in my analysis? (2) Pre-filtering for low-count genes seems to have two advantages for me: resolving the alarming U-shaped p-value distribution and slightly more significant genes, because multiple testing correction for fewer genes. Are there disadvantages of pre-filtering? Could this produce bias? What is troubling more: pre-filtering or U-shaped p-value distribution? (3) Would you recommend to apply the pre-filtering on the normalized counts instead of raw counts?

Below my code and the histograms

# without pre-filtering for low-count genes

ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory, design = ~ patient + condition)
ddsHTSeq <- ddsHTSeq[ rowSums(counts(ddsHTSeq)) > 1, ]
dds <- DESeq(ddsHTSeq, betaPrior = FALSE)

## 1 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest

ddsClean <- dds[which(mcols(dds)$betaConv),]
dds <- ddsClean
res <- results(dds, independentFiltering = TRUE, alpha = 0.05)
res <- lfcShrink(dds, coef="condition_1_vs_2", type="apeglm", res = res) # not p-value-relevant
summary(res)

out of 37440 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)     : 4826, 13% 
LFC < 0 (down)   : 5104, 14% 
outliers [1]     : 0, 0% 
low counts [2]   : 13066, 35% 
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

hist(res$pvalue ,
     col = "grey", border = "white", xlab = "", ylab = "",
     main = "frequencies of p-values")

p-values for test without pre-filtering

#-------------------

# with pre-filtering for low-count genes

ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory, design = ~ patient + condition)
nrow(sampleTable)

68

ddsHTSeq <- ddsHTSeq[ (rowSums(counts(ddsHTSeq) >= 5 ) >= 7),] # 7 samples are 10% of my samples
dds <- DESeq(ddsHTSeq, betaPrior = FALSE)

## 5 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest

ddsClean <- dds[which(mcols(dds)$betaConv),]
dds <- ddsClean
res <- results(dds, independentFiltering = TRUE, alpha = 0.05)
res <- lfcShrink(dds, coef="condition_1_vs_2", type="apeglm", res = res) # not p-value-relevant
summary(res)

out of 22339 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)     : 4856, 22% 
LFC < 0 (down)   : 5137, 23% 
outliers [1]     : 0, 0% 
low counts [2]   : 0, 0% 
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

hist(res$pvalue ,
     col = "grey", border = "white", xlab = "", ylab = "",
     main = "frequencies of p-values")

p-values for test with pre-filtering

I appreciate every hint, thank you in advance!

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8     LC_MONETARY=de_DE.UTF-8   
 [6] LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DESeq2_1.18.1              SummarizedExperiment_1.8.1 DelayedArray_0.4.1         matrixStats_0.54.0         Biobase_2.38.0            
 [6] GenomicRanges_1.30.3       GenomeInfoDb_1.14.0        IRanges_2.12.0             S4Vectors_0.16.0           BiocGenerics_0.24.0       

loaded via a namespace (and not attached):
 [1] bit64_0.9-7            splines_3.4.4          Formula_1.2-3          assertthat_0.2.0       latticeExtra_0.6-28    blob_1.1.1            
 [7] GenomeInfoDbData_1.0.0 pillar_1.3.1           RSQLite_2.1.1          backports_1.1.2        lattice_0.20-35        glue_1.3.0            
[13] digest_0.6.18          RColorBrewer_1.1-2     XVector_0.18.0         checkmate_1.8.5        colorspace_1.3-2       htmltools_0.3.6       
[19] Matrix_1.2-14          plyr_1.8.4             XML_3.98-1.16          pkgconfig_2.0.2        genefilter_1.60.0      zlibbioc_1.24.0       
[25] purrr_0.2.5            xtable_1.8-2           scales_1.0.0           BiocParallel_1.12.0    htmlTable_1.12         tibble_1.4.2          
[31] annotate_1.56.2        ggplot2_3.1.0          nnet_7.3-12            lazyeval_0.2.1         survival_2.43-3        magrittr_1.5          
[37] crayon_1.3.4           memoise_1.1.0          foreign_0.8-71         tools_3.4.4            data.table_1.11.4      stringr_1.3.1         
[43] locfit_1.5-9.1         munsell_0.5.0          cluster_2.0.7-1        AnnotationDbi_1.40.0   bindrcpp_0.2.2         compiler_3.4.4        
[49] rlang_0.3.0.1          grid_3.4.4             RCurl_1.95-4.11        rstudioapi_0.7         htmlwidgets_1.2        bitops_1.0-6          
[55] base64enc_0.1-3        gtable_0.2.0           DBI_1.0.0              R6_2.2.2               gridExtra_2.3          knitr_1.20            
[61] dplyr_0.7.8            bit_1.1-14             bindr_0.1.1            Hmisc_4.1-1            stringi_1.2.4          Rcpp_1.0.0            
[67] geneplotter_1.56.0     rpart_4.1-13           acepack_1.4.1          tidyselect_0.2.5
ADD COMMENTlink modified 8 months ago • written 8 months ago by stu1115380

Yes, this answers my question, thank you Michael! Sorry, I was not aware that independent filtering only has influence on the adjusted p-values.

ADD REPLYlink written 8 months ago by stu1115380
Answer: U-shape of p-value histograms is removed by filtering for genes that have >5 cou
1
gravatar for Michael Love
8 months ago by
Michael Love25k
United States
Michael Love25k wrote:

The independent filtering does not remove p-values from your results table, but it excludes those genes with low mean from the multiple test correction. They will get an NA for the adjusted p-value (see vignette). Does this answer your question?

The p-value histogram isn't alarming or troubling to me. But yes, I typically would remove the very low count genes to look at a histogram of p-values. But beyond visual inspection of that histogram, it's not required.

ADD COMMENTlink written 8 months ago by Michael Love25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 148 users visited in the last hour