Few significance in DESeq2
1
0
Entering edit mode
bioinfo • 0
@bioinfo-12782
Last seen 20 months ago
United States

Hi I am comparing RNA-seq between two conditions (WT vs. KO) using DESeq2. I have two different distribution of p-values depending on the cut-off for genes based on counting and application of fdrtool. I would like to listen to your advice for which is a correct way to go.

I used the following code to generate figures:

> dds<-dds[rowSums(counts(dds))>=count.th,]
> dds<-DESeq(dds)
> res<-results(dds,contrast = c("Group","KO","WT"))
> hist(res$pvalue) # for figure A and C

> res.emp <- res
> res.emp <- res.emp[ !is.na(res.emp$pvalue), ]
> emp.pval <- fdrtool(res.emp$stat, statistic= "normal", plot = T)
> hist(emp.pval$pval) # figure B and D

In the attached figure (link), A and B were generated by count.th =1 with A before fdrtool, B after fdrtool. C and D were generated by count.th=10, with C before fdrtool, D after fdrtool. 

A and B produced some number (~10) of significant genes (padj<0.01) while C and D produced no significant genes (padj<0.01).

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.6

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils    
[7] datasets  methods   base     

other attached packages:
 [1] fdrtool_1.2.15             DESeq2_1.14.1             
 [3] SummarizedExperiment_1.4.0 Biobase_2.34.0            
 [5] GenomicRanges_1.26.4       GenomeInfoDb_1.10.3       
 [7] IRanges_2.8.2              S4Vectors_0.12.2          
 [9] BiocGenerics_0.20.0        reshape2_1.4.2            
[11] pheatmap_1.0.8             ggplot2_2.2.1             

loaded via a namespace (and not attached):
 [1] genefilter_1.56.0    locfit_1.5-9.1       splines_3.3.3       
 [4] lattice_0.20-34      colorspace_1.3-2     htmltools_0.3.6     
 [7] base64enc_0.1-3      blob_1.1.0           survival_2.40-1     
[10] XML_3.98-1.9         rlang_0.1.2          foreign_0.8-67      
[13] DBI_0.7              BiocParallel_1.8.2   bit64_0.9-7         
[16] RColorBrewer_1.1-2   plyr_1.8.4           stringr_1.2.0       
[19] zlibbioc_1.20.0      munsell_0.4.3        gtable_0.2.0        
[22] htmlwidgets_0.9      memoise_1.1.0        latticeExtra_0.6-28 
[25] knitr_1.17           geneplotter_1.52.0   AnnotationDbi_1.36.2
[28] htmlTable_1.9        Rcpp_0.12.13         acepack_1.4.1       
[31] xtable_1.8-2         scales_0.5.0         backports_1.1.1     
[34] checkmate_1.8.5      Hmisc_4.0-3          annotate_1.52.1     
[37] XVector_0.14.1       bit_1.1-12           gridExtra_2.3       
[40] digest_0.6.12        stringi_1.1.5        grid_3.3.3          
[43] tools_3.3.3          bitops_1.0-6         magrittr_1.5        
[46] lazyeval_0.2.1       RCurl_1.95-4.8       tibble_1.3.4        
[49] RSQLite_2.0          Formula_1.2-2        cluster_2.0.5       
[52] Matrix_1.2-8         data.table_1.10.4-3  rpart_4.1-10        
[55] nnet_7.3-12         

 

 

 

deseq2 • 846 views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 12 minutes ago
United States

The spikes in the histrogram in A and B are from genes where you have e.g. 1 count across all samples.

If you are going to assess the p-value histogram, I'd recommend to use a stronger filter, e.g. 

keep <- rowSums(counts(dds, normalized=TRUE) >= 10) >= n

..., that is 'n' or more samples with a normalized count of 10 or more.

I don't in general recommend altering the p-values to create more significant hits. What does the PCA plot for this experiment look like? The data may be telling you that the differences between groups are not larger than the biological variability. That should be the default assumption when there are few hits.

ADD COMMENT
0
Entering edit mode

Hi Michael,

Thanks for your answer. You are right. In PCA plot, the samples are intermingled regardless of conditions. I understand that there are few significant genes in this case. However, while I expect that the distribution of p-value is almost flat in the case, I currently have some depletion around 0 in the plot of p-value distribution as you see figure (c) and (d). Could you explain me how to understand this?

ADD REPLY
0
Entering edit mode

It could be from unaccounted-for heterogeneity, and there are packages that help you estimate surrogate variables or factors of unwanted variation.

In general, I don't like post-processing the p-values to induce some differential results unless there's a clear case for it, and that p-value histogram isn't a good case for it.

ADD REPLY
0
Entering edit mode

Thanks for your help. I agree with you. I don't want to do post-processing the p-values to have significance. I can conclude that they do not have significant genes. But it is still unclear that why the distribution of p-value is not flat in the case of no significant gene. If I understand you correctly, if we have unaccounted heterogeneity, it is possible for the distribution of p-values to have depletion near zero. Could you confirm my understanding?

ADD REPLY
0
Entering edit mode

Yes this can induce non uniform null.

ADD REPLY

Login before adding your answer.

Traffic: 1062 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6