U-shape of p-value histograms is removed by filtering for genes that have >5 counts in >10% of samples before applying DESeq2 analysis
1
0
Entering edit mode
stu111538 • 0
@stu111538-13994
Last seen 3.6 years ago
Germany, Kiel, University Hospital Kiel

Hi, I used DESeq2 for several differential expression analyses for paired patient samples (sample sizes in the range of 6 to 34 patients). The unadjusted p-values (with applying independentFiltering = TRUE) of the DESeq2 analyses (both Wald test and time series experiments using LRT) always showed a U-shaped distribution. I could not figure out what is wrong with my data or my applied tests. Now, I filtered out genes where there are less than 10% of samples with raw counts greater than or equal to 5 prior to DESeq2 analysis. The distribution of p-values is not U-shaped anymore. Without filtering for low-count genes I had extremely many genes yielding p-values close to 1. Now this extreme is resolved. My question: (1) There was one comment from Micheal Love (https://support.bioconductor.org/p/65256/) saying that there is no need for pre-filtering if one uses independentFiltering = TRUE, but my analysis included independentFiltering = TRUE. Why does pre-filtering remove most of the 'close-to-1' p-values and change the shape of the p-value distribution in my analysis? (2) Pre-filtering for low-count genes seems to have two advantages for me: resolving the alarming U-shaped p-value distribution and slightly more significant genes, because multiple testing correction for fewer genes. Are there disadvantages of pre-filtering? Could this produce bias? What is troubling more: pre-filtering or U-shaped p-value distribution? (3) Would you recommend to apply the pre-filtering on the normalized counts instead of raw counts?

Below my code and the histograms

# without pre-filtering for low-count genes

ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory, design = ~ patient + condition)
ddsHTSeq <- ddsHTSeq[ rowSums(counts(ddsHTSeq)) > 1, ]
dds <- DESeq(ddsHTSeq, betaPrior = FALSE)

## 1 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest

ddsClean <- dds[which(mcols(dds)$betaConv),]
dds <- ddsClean
res <- results(dds, independentFiltering = TRUE, alpha = 0.05)
res <- lfcShrink(dds, coef="condition_1_vs_2", type="apeglm", res = res) # not p-value-relevant
summary(res)

out of 37440 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)     : 4826, 13% 
LFC < 0 (down)   : 5104, 14% 
outliers [1]     : 0, 0% 
low counts [2]   : 13066, 35% 
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

hist(res$pvalue ,
     col = "grey", border = "white", xlab = "", ylab = "",
     main = "frequencies of p-values")

p-values for test without pre-filtering

#-------------------

# with pre-filtering for low-count genes

ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory, design = ~ patient + condition)
nrow(sampleTable)

68

ddsHTSeq <- ddsHTSeq[ (rowSums(counts(ddsHTSeq) >= 5 ) >= 7),] # 7 samples are 10% of my samples
dds <- DESeq(ddsHTSeq, betaPrior = FALSE)

## 5 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest

ddsClean <- dds[which(mcols(dds)$betaConv),]
dds <- ddsClean
res <- results(dds, independentFiltering = TRUE, alpha = 0.05)
res <- lfcShrink(dds, coef="condition_1_vs_2", type="apeglm", res = res) # not p-value-relevant
summary(res)

out of 22339 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)     : 4856, 22% 
LFC < 0 (down)   : 5137, 23% 
outliers [1]     : 0, 0% 
low counts [2]   : 0, 0% 
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

hist(res$pvalue ,
     col = "grey", border = "white", xlab = "", ylab = "",
     main = "frequencies of p-values")

p-values for test with pre-filtering

I appreciate every hint, thank you in advance!

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8     LC_MONETARY=de_DE.UTF-8   
 [6] LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DESeq2_1.18.1              SummarizedExperiment_1.8.1 DelayedArray_0.4.1         matrixStats_0.54.0         Biobase_2.38.0            
 [6] GenomicRanges_1.30.3       GenomeInfoDb_1.14.0        IRanges_2.12.0             S4Vectors_0.16.0           BiocGenerics_0.24.0       

loaded via a namespace (and not attached):
 [1] bit64_0.9-7            splines_3.4.4          Formula_1.2-3          assertthat_0.2.0       latticeExtra_0.6-28    blob_1.1.1            
 [7] GenomeInfoDbData_1.0.0 pillar_1.3.1           RSQLite_2.1.1          backports_1.1.2        lattice_0.20-35        glue_1.3.0            
[13] digest_0.6.18          RColorBrewer_1.1-2     XVector_0.18.0         checkmate_1.8.5        colorspace_1.3-2       htmltools_0.3.6       
[19] Matrix_1.2-14          plyr_1.8.4             XML_3.98-1.16          pkgconfig_2.0.2        genefilter_1.60.0      zlibbioc_1.24.0       
[25] purrr_0.2.5            xtable_1.8-2           scales_1.0.0           BiocParallel_1.12.0    htmlTable_1.12         tibble_1.4.2          
[31] annotate_1.56.2        ggplot2_3.1.0          nnet_7.3-12            lazyeval_0.2.1         survival_2.43-3        magrittr_1.5          
[37] crayon_1.3.4           memoise_1.1.0          foreign_0.8-71         tools_3.4.4            data.table_1.11.4      stringr_1.3.1         
[43] locfit_1.5-9.1         munsell_0.5.0          cluster_2.0.7-1        AnnotationDbi_1.40.0   bindrcpp_0.2.2         compiler_3.4.4        
[49] rlang_0.3.0.1          grid_3.4.4             RCurl_1.95-4.11        rstudioapi_0.7         htmlwidgets_1.2        bitops_1.0-6          
[55] base64enc_0.1-3        gtable_0.2.0           DBI_1.0.0              R6_2.2.2               gridExtra_2.3          knitr_1.20            
[61] dplyr_0.7.8            bit_1.1-14             bindr_0.1.1            Hmisc_4.1-1            stringi_1.2.4          Rcpp_1.0.0            
[67] geneplotter_1.56.0     rpart_4.1-13           acepack_1.4.1          tidyselect_0.2.5
deseq2 p-value distribution U-shaped pre-filter low counts • 2.1k views
ADD COMMENT
0
Entering edit mode

Yes, this answers my question, thank you Michael! Sorry, I was not aware that independent filtering only has influence on the adjusted p-values.

ADD REPLY
1
Entering edit mode
@mikelove
Last seen 20 hours ago
United States

The independent filtering does not remove p-values from your results table, but it excludes those genes with low mean from the multiple test correction. They will get an NA for the adjusted p-value (see vignette). Does this answer your question?

The p-value histogram isn't alarming or troubling to me. But yes, I typically would remove the very low count genes to look at a histogram of p-values. But beyond visual inspection of that histogram, it's not required.

ADD COMMENT
0
Entering edit mode

Hi Michael,

I have a question regarding pre-filtering that is why I am piggy backing on OP's question. In my case if I do not pre-filter I get a large number of DEGs but while looking closely most of them are coming from very very low count genes. Please see attached image for the histogram of basemean of the DEGs (most of them are bellow 0.5). What is your suggestion regarding these genes and the validity of their significance?

Thanks in advance. Nurun

#without pre-filtering

dds <- DESeqDataSetFromMatrix(countData = Count.selected,
                              colData = SampleData.selected, 
                              design = ~RNA_Batch+RIN+Sex+Disease_Status)

dds <- estimateSizeFactors(dds)
dds <- DESeq(dds)
resultsNames(dds)
res.no.pre-filter <- results(dds, contrast=c("Disease_Status","HI_MS","CTRL"), independentFiltering = TRUE)
summary(res.no.pre-filter)

out of 48599 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 151, 0.31%
LFC < 0 (down)     : 108, 0.22%
outliers [1]       : 0, 0%
low counts [2]     : 895, 1.8%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

https://www.dropbox.com/s/huyank6lpui0aqg/hist.basemean.png?dl=0

#with Pre-filtering

dds <- DESeqDataSetFromMatrix(countData = Count.selected,
                              colData = SampleData.selected, 
                              design = ~RNA_Batch+RIN+Sex+Disease_Status)
dds <- estimateSizeFactors(dds)
idx <- rowSums( counts(dds, normalized=TRUE) > 0) >= 5
dds <- dds[idx,]
dds <- DESeq(dds)
res.pre-filter <- results(dds, contrast=c("Disease_Status","HI_MS","CTRL"), independentFiltering = TRUE)
summary(res.pre-filter)

out of 37443 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 48, 0.13%
LFC < 0 (down)     : 12, 0.032%
outliers [1]       : 0, 0%
low counts [2]     : 0, 0%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

https://www.dropbox.com/s/cdskoqshsq5vm6t/hist.filtered.basemean.png?dl=0

ADD REPLY
0
Entering edit mode

Can you post a plotCounts of a gene that you think shouldn't be called DE? Low average count doesn't tell me much. If you have counts of [10, 10, 10] vs another group with no expression that will and should give a DE call, despite having a low average count.

ADD REPLY
0
Entering edit mode

For example:


>res["ORM1", ]

log2 fold change (MLE): Disease_Status HI_MS vs CTRL 
Wald test p-value: Disease Status HI MS vs CTRL 
DataFrame with 1 row and 6 columns
              baseMean   log2FoldChange            lfcSE             stat               pvalue                 padj
             <numeric>        <numeric>        <numeric>        <numeric>            <numeric>            <numeric>
ORM1 0.197299466348842 29.2902036475426 4.02601257611984 7.27523898491438 3.45807707055826e-13 2.42594277314576e-10

>plotCounts(dds, gene="ORM1", intgroup="Disease_Status")

https://www.dropbox.com/s/hp32t345xp2hpzl/ORM1.png?dl=0

Only one sample in HI_MS group shows an up-regulation and it is called as a DE.

ADD REPLY
0
Entering edit mode

One aspect is that this would likely be filtered for Cook's distance outliers except filtering here isn't possible because of the continuous covariate you're using. I don't typically use RIN in the design formula, I prefer using a method like RUV or SVA (see workflow for code). But can you try again with ~batch + sex + disease?

That or I'd recommend to use a filter for at least X samples with a count >= 10, e.g. 3 or 5 samples.

ADD REPLY
0
Entering edit mode

Thank you for clarification.

ADD REPLY

Login before adding your answer.

Traffic: 794 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6