Question: U-shape of p-value histograms is removed by filtering for genes that have >5 counts in >10% of samples before applying DESeq2 analysis
0
gravatar for stu111538
11 months ago by
stu1115380
Germany, Kiel, University Hospital Kiel
stu1115380 wrote:

Hi, I used DESeq2 for several differential expression analyses for paired patient samples (sample sizes in the range of 6 to 34 patients). The unadjusted p-values (with applying independentFiltering = TRUE) of the DESeq2 analyses (both Wald test and time series experiments using LRT) always showed a U-shaped distribution. I could not figure out what is wrong with my data or my applied tests. Now, I filtered out genes where there are less than 10% of samples with raw counts greater than or equal to 5 prior to DESeq2 analysis. The distribution of p-values is not U-shaped anymore. Without filtering for low-count genes I had extremely many genes yielding p-values close to 1. Now this extreme is resolved. My question: (1) There was one comment from Micheal Love (https://support.bioconductor.org/p/65256/) saying that there is no need for pre-filtering if one uses independentFiltering = TRUE, but my analysis included independentFiltering = TRUE. Why does pre-filtering remove most of the 'close-to-1' p-values and change the shape of the p-value distribution in my analysis? (2) Pre-filtering for low-count genes seems to have two advantages for me: resolving the alarming U-shaped p-value distribution and slightly more significant genes, because multiple testing correction for fewer genes. Are there disadvantages of pre-filtering? Could this produce bias? What is troubling more: pre-filtering or U-shaped p-value distribution? (3) Would you recommend to apply the pre-filtering on the normalized counts instead of raw counts?

Below my code and the histograms

# without pre-filtering for low-count genes

ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory, design = ~ patient + condition)
ddsHTSeq <- ddsHTSeq[ rowSums(counts(ddsHTSeq)) > 1, ]
dds <- DESeq(ddsHTSeq, betaPrior = FALSE)

## 1 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest

ddsClean <- dds[which(mcols(dds)$betaConv),]
dds <- ddsClean
res <- results(dds, independentFiltering = TRUE, alpha = 0.05)
res <- lfcShrink(dds, coef="condition_1_vs_2", type="apeglm", res = res) # not p-value-relevant
summary(res)

out of 37440 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)     : 4826, 13% 
LFC < 0 (down)   : 5104, 14% 
outliers [1]     : 0, 0% 
low counts [2]   : 13066, 35% 
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

hist(res$pvalue ,
     col = "grey", border = "white", xlab = "", ylab = "",
     main = "frequencies of p-values")

p-values for test without pre-filtering

#-------------------

# with pre-filtering for low-count genes

ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory, design = ~ patient + condition)
nrow(sampleTable)

68

ddsHTSeq <- ddsHTSeq[ (rowSums(counts(ddsHTSeq) >= 5 ) >= 7),] # 7 samples are 10% of my samples
dds <- DESeq(ddsHTSeq, betaPrior = FALSE)

## 5 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest

ddsClean <- dds[which(mcols(dds)$betaConv),]
dds <- ddsClean
res <- results(dds, independentFiltering = TRUE, alpha = 0.05)
res <- lfcShrink(dds, coef="condition_1_vs_2", type="apeglm", res = res) # not p-value-relevant
summary(res)

out of 22339 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)     : 4856, 22% 
LFC < 0 (down)   : 5137, 23% 
outliers [1]     : 0, 0% 
low counts [2]   : 0, 0% 
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

hist(res$pvalue ,
     col = "grey", border = "white", xlab = "", ylab = "",
     main = "frequencies of p-values")

p-values for test with pre-filtering

I appreciate every hint, thank you in advance!

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8     LC_MONETARY=de_DE.UTF-8   
 [6] LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DESeq2_1.18.1              SummarizedExperiment_1.8.1 DelayedArray_0.4.1         matrixStats_0.54.0         Biobase_2.38.0            
 [6] GenomicRanges_1.30.3       GenomeInfoDb_1.14.0        IRanges_2.12.0             S4Vectors_0.16.0           BiocGenerics_0.24.0       

loaded via a namespace (and not attached):
 [1] bit64_0.9-7            splines_3.4.4          Formula_1.2-3          assertthat_0.2.0       latticeExtra_0.6-28    blob_1.1.1            
 [7] GenomeInfoDbData_1.0.0 pillar_1.3.1           RSQLite_2.1.1          backports_1.1.2        lattice_0.20-35        glue_1.3.0            
[13] digest_0.6.18          RColorBrewer_1.1-2     XVector_0.18.0         checkmate_1.8.5        colorspace_1.3-2       htmltools_0.3.6       
[19] Matrix_1.2-14          plyr_1.8.4             XML_3.98-1.16          pkgconfig_2.0.2        genefilter_1.60.0      zlibbioc_1.24.0       
[25] purrr_0.2.5            xtable_1.8-2           scales_1.0.0           BiocParallel_1.12.0    htmlTable_1.12         tibble_1.4.2          
[31] annotate_1.56.2        ggplot2_3.1.0          nnet_7.3-12            lazyeval_0.2.1         survival_2.43-3        magrittr_1.5          
[37] crayon_1.3.4           memoise_1.1.0          foreign_0.8-71         tools_3.4.4            data.table_1.11.4      stringr_1.3.1         
[43] locfit_1.5-9.1         munsell_0.5.0          cluster_2.0.7-1        AnnotationDbi_1.40.0   bindrcpp_0.2.2         compiler_3.4.4        
[49] rlang_0.3.0.1          grid_3.4.4             RCurl_1.95-4.11        rstudioapi_0.7         htmlwidgets_1.2        bitops_1.0-6          
[55] base64enc_0.1-3        gtable_0.2.0           DBI_1.0.0              R6_2.2.2               gridExtra_2.3          knitr_1.20            
[61] dplyr_0.7.8            bit_1.1-14             bindr_0.1.1            Hmisc_4.1-1            stringi_1.2.4          Rcpp_1.0.0            
[67] geneplotter_1.56.0     rpart_4.1-13           acepack_1.4.1          tidyselect_0.2.5
ADD COMMENTlink modified 11 months ago • written 11 months ago by stu1115380

Yes, this answers my question, thank you Michael! Sorry, I was not aware that independent filtering only has influence on the adjusted p-values.

ADD REPLYlink written 11 months ago by stu1115380
Answer: U-shape of p-value histograms is removed by filtering for genes that have >5 cou
1
gravatar for Michael Love
11 months ago by
Michael Love26k
United States
Michael Love26k wrote:

The independent filtering does not remove p-values from your results table, but it excludes those genes with low mean from the multiple test correction. They will get an NA for the adjusted p-value (see vignette). Does this answer your question?

The p-value histogram isn't alarming or troubling to me. But yes, I typically would remove the very low count genes to look at a histogram of p-values. But beyond visual inspection of that histogram, it's not required.

ADD COMMENTlink written 11 months ago by Michael Love26k

Hi Michael,

I have a question regarding pre-filtering that is why I am piggy backing on OP's question. In my case if I do not pre-filter I get a large number of DEGs but while looking closely most of them are coming from very very low count genes. Please see attached image for the histogram of basemean of the DEGs (most of them are bellow 0.5). What is your suggestion regarding these genes and the validity of their significance?

Thanks in advance. Nurun

#without pre-filtering

dds <- DESeqDataSetFromMatrix(countData = Count.selected,
                              colData = SampleData.selected, 
                              design = ~RNA_Batch+RIN+Sex+Disease_Status)

dds <- estimateSizeFactors(dds)
dds <- DESeq(dds)
resultsNames(dds)
res.no.pre-filter <- results(dds, contrast=c("Disease_Status","HI_MS","CTRL"), independentFiltering = TRUE)
summary(res.no.pre-filter)

out of 48599 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 151, 0.31%
LFC < 0 (down)     : 108, 0.22%
outliers [1]       : 0, 0%
low counts [2]     : 895, 1.8%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

https://www.dropbox.com/s/huyank6lpui0aqg/hist.basemean.png?dl=0

#with Pre-filtering

dds <- DESeqDataSetFromMatrix(countData = Count.selected,
                              colData = SampleData.selected, 
                              design = ~RNA_Batch+RIN+Sex+Disease_Status)
dds <- estimateSizeFactors(dds)
idx <- rowSums( counts(dds, normalized=TRUE) > 0) >= 5
dds <- dds[idx,]
dds <- DESeq(dds)
res.pre-filter <- results(dds, contrast=c("Disease_Status","HI_MS","CTRL"), independentFiltering = TRUE)
summary(res.pre-filter)

out of 37443 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 48, 0.13%
LFC < 0 (down)     : 12, 0.032%
outliers [1]       : 0, 0%
low counts [2]     : 0, 0%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

https://www.dropbox.com/s/cdskoqshsq5vm6t/hist.filtered.basemean.png?dl=0

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by n.naharfancy0

Can you post a plotCounts of a gene that you think shouldn't be called DE? Low average count doesn't tell me much. If you have counts of [10, 10, 10] vs another group with no expression that will and should give a DE call, despite having a low average count.

ADD REPLYlink written 5 weeks ago by Michael Love26k

For example:


>res["ORM1", ]

log2 fold change (MLE): Disease_Status HI_MS vs CTRL 
Wald test p-value: Disease Status HI MS vs CTRL 
DataFrame with 1 row and 6 columns
              baseMean   log2FoldChange            lfcSE             stat               pvalue                 padj
             <numeric>        <numeric>        <numeric>        <numeric>            <numeric>            <numeric>
ORM1 0.197299466348842 29.2902036475426 4.02601257611984 7.27523898491438 3.45807707055826e-13 2.42594277314576e-10

>plotCounts(dds, gene="ORM1", intgroup="Disease_Status")

https://www.dropbox.com/s/hp32t345xp2hpzl/ORM1.png?dl=0

Only one sample in HI_MS group shows an up-regulation and it is called as a DE.

ADD REPLYlink written 4 weeks ago by n.naharfancy0

One aspect is that this would likely be filtered for Cook's distance outliers except filtering here isn't possible because of the continuous covariate you're using. I don't typically use RIN in the design formula, I prefer using a method like RUV or SVA (see workflow for code). But can you try again with ~batch + sex + disease?

That or I'd recommend to use a filter for at least X samples with a count >= 10, e.g. 3 or 5 samples.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Michael Love26k

Thank you for clarification.

ADD REPLYlink written 4 weeks ago by n.naharfancy0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 364 users visited in the last hour