Few significance in DESeq2
Entering edit mode
bioinfo • 0
Last seen 3 months ago
United States

Hi I am comparing RNA-seq between two conditions (WT vs. KO) using DESeq2. I have two different distribution of p-values depending on the cut-off for genes based on counting and application of fdrtool. I would like to listen to your advice for which is a correct way to go.

I used the following code to generate figures:

> dds<-dds[rowSums(counts(dds))>=count.th,]
> dds<-DESeq(dds)
> res<-results(dds,contrast = c("Group","KO","WT"))
> hist(res$pvalue) # for figure A and C

> res.emp <- res
> res.emp <- res.emp[ !is.na(res.emp$pvalue), ]
> emp.pval <- fdrtool(res.emp$stat, statistic= "normal", plot = T)
> hist(emp.pval$pval) # figure B and D

In the attached figure (link), A and B were generated by count.th =1 with A before fdrtool, B after fdrtool. C and D were generated by count.th=10, with C before fdrtool, D after fdrtool. 

A and B produced some number (~10) of significant genes (padj<0.01) while C and D produced no significant genes (padj<0.01).

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.6

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils    
[7] datasets  methods   base     

other attached packages:
 [1] fdrtool_1.2.15             DESeq2_1.14.1             
 [3] SummarizedExperiment_1.4.0 Biobase_2.34.0            
 [5] GenomicRanges_1.26.4       GenomeInfoDb_1.10.3       
 [7] IRanges_2.8.2              S4Vectors_0.12.2          
 [9] BiocGenerics_0.20.0        reshape2_1.4.2            
[11] pheatmap_1.0.8             ggplot2_2.2.1             

loaded via a namespace (and not attached):
 [1] genefilter_1.56.0    locfit_1.5-9.1       splines_3.3.3       
 [4] lattice_0.20-34      colorspace_1.3-2     htmltools_0.3.6     
 [7] base64enc_0.1-3      blob_1.1.0           survival_2.40-1     
[10] XML_3.98-1.9         rlang_0.1.2          foreign_0.8-67      
[13] DBI_0.7              BiocParallel_1.8.2   bit64_0.9-7         
[16] RColorBrewer_1.1-2   plyr_1.8.4           stringr_1.2.0       
[19] zlibbioc_1.20.0      munsell_0.4.3        gtable_0.2.0        
[22] htmlwidgets_0.9      memoise_1.1.0        latticeExtra_0.6-28 
[25] knitr_1.17           geneplotter_1.52.0   AnnotationDbi_1.36.2
[28] htmlTable_1.9        Rcpp_0.12.13         acepack_1.4.1       
[31] xtable_1.8-2         scales_0.5.0         backports_1.1.1     
[34] checkmate_1.8.5      Hmisc_4.0-3          annotate_1.52.1     
[37] XVector_0.14.1       bit_1.1-12           gridExtra_2.3       
[40] digest_0.6.12        stringi_1.1.5        grid_3.3.3          
[43] tools_3.3.3          bitops_1.0-6         magrittr_1.5        
[46] lazyeval_0.2.1       RCurl_1.95-4.8       tibble_1.3.4        
[49] RSQLite_2.0          Formula_1.2-2        cluster_2.0.5       
[52] Matrix_1.2-8         data.table_1.10.4-3  rpart_4.1-10        
[55] nnet_7.3-12         




deseq2 • 591 views
Entering edit mode
Last seen 3 days ago
United States

The spikes in the histrogram in A and B are from genes where you have e.g. 1 count across all samples.

If you are going to assess the p-value histogram, I'd recommend to use a stronger filter, e.g. 

keep <- rowSums(counts(dds, normalized=TRUE) >= 10) >= n

..., that is 'n' or more samples with a normalized count of 10 or more.

I don't in general recommend altering the p-values to create more significant hits. What does the PCA plot for this experiment look like? The data may be telling you that the differences between groups are not larger than the biological variability. That should be the default assumption when there are few hits.

Entering edit mode

Hi Michael,

Thanks for your answer. You are right. In PCA plot, the samples are intermingled regardless of conditions. I understand that there are few significant genes in this case. However, while I expect that the distribution of p-value is almost flat in the case, I currently have some depletion around 0 in the plot of p-value distribution as you see figure (c) and (d). Could you explain me how to understand this?

Entering edit mode

It could be from unaccounted-for heterogeneity, and there are packages that help you estimate surrogate variables or factors of unwanted variation.

In general, I don't like post-processing the p-values to induce some differential results unless there's a clear case for it, and that p-value histogram isn't a good case for it.

Entering edit mode

Thanks for your help. I agree with you. I don't want to do post-processing the p-values to have significance. I can conclude that they do not have significant genes. But it is still unclear that why the distribution of p-value is not flat in the case of no significant gene. If I understand you correctly, if we have unaccounted heterogeneity, it is possible for the distribution of p-values to have depletion near zero. Could you confirm my understanding?

Entering edit mode

Yes this can induce non uniform null.


Login before adding your answer.

Traffic: 219 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6