Question

Few significance in DESeq2

0

Entering edit mode

bioinfo • 0

@bioinfo-12782

Last seen 20 months ago

United States

Hi I am comparing RNA-seq between two conditions (WT vs. KO) using DESeq2. I have two different distribution of p-values depending on the cut-off for genes based on counting and application of fdrtool. I would like to listen to your advice for which is a correct way to go.

I used the following code to generate figures:

> dds<-dds[rowSums(counts(dds))>=count.th,]
> dds<-DESeq(dds)
> res<-results(dds,contrast = c("Group","KO","WT"))
> hist(res$pvalue) # for figure A and C

> res.emp <- res
> res.emp <- res.emp[ !is.na(res.emp$pvalue), ]
> emp.pval <- fdrtool(res.emp$stat, statistic= "normal", plot = T)
> hist(emp.pval$pval) # figure B and D

In the attached figure (link), A and B were generated by count.th =1 with A before fdrtool, B after fdrtool. C and D were generated by count.th=10, with C before fdrtool, D after fdrtool.

A and B produced some number (~10) of significant genes (padj<0.01) while C and D produced no significant genes (padj<0.01).

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.6

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel stats4 stats graphics grDevices utils
[7] datasets methods base

other attached packages:
[1] fdrtool_1.2.15 DESeq2_1.14.1
[3] SummarizedExperiment_1.4.0 Biobase_2.34.0
[5] GenomicRanges_1.26.4 GenomeInfoDb_1.10.3
[7] IRanges_2.8.2 S4Vectors_0.12.2
[9] BiocGenerics_0.20.0 reshape2_1.4.2
[11] pheatmap_1.0.8 ggplot2_2.2.1

loaded via a namespace (and not attached):
[1] genefilter_1.56.0 locfit_1.5-9.1 splines_3.3.3
[4] lattice_0.20-34 colorspace_1.3-2 htmltools_0.3.6
[7] base64enc_0.1-3 blob_1.1.0 survival_2.40-1
[10] XML_3.98-1.9 rlang_0.1.2 foreign_0.8-67
[13] DBI_0.7 BiocParallel_1.8.2 bit64_0.9-7
[16] RColorBrewer_1.1-2 plyr_1.8.4 stringr_1.2.0
[19] zlibbioc_1.20.0 munsell_0.4.3 gtable_0.2.0
[22] htmlwidgets_0.9 memoise_1.1.0 latticeExtra_0.6-28
[25] knitr_1.17 geneplotter_1.52.0 AnnotationDbi_1.36.2
[28] htmlTable_1.9 Rcpp_0.12.13 acepack_1.4.1
[31] xtable_1.8-2 scales_0.5.0 backports_1.1.1
[34] checkmate_1.8.5 Hmisc_4.0-3 annotate_1.52.1
[37] XVector_0.14.1 bit_1.1-12 gridExtra_2.3
[40] digest_0.6.12 stringi_1.1.5 grid_3.3.3
[43] tools_3.3.3 bitops_1.0-6 magrittr_1.5
[46] lazyeval_0.2.1 RCurl_1.95-4.8 tibble_1.3.4
[49] RSQLite_2.0 Formula_1.2-2 cluster_2.0.5
[52] Matrix_1.2-8 data.table_1.10.4-3 rpart_4.1-10
[55] nnet_7.3-12

deseq2 • 846 views

ADD COMMENT • link updated 6.0 years ago by Michael Love 41k • written 6.0 years ago by bioinfo • 0

score 0 · Answer 1 · 2018-05-15

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 12 minutes ago

United States

The spikes in the histrogram in A and B are from genes where you have e.g. 1 count across all samples.

If you are going to assess the p-value histogram, I'd recommend to use a stronger filter, e.g.

keep <- rowSums(counts(dds, normalized=TRUE) >= 10) >= n

..., that is 'n' or more samples with a normalized count of 10 or more.

I don't in general recommend altering the p-values to create more significant hits. What does the PCA plot for this experiment look like? The data may be telling you that the differences between groups are not larger than the biological variability. That should be the default assumption when there are few hits.

ADD COMMENT • link 6.0 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

Thanks for your answer. You are right. In PCA plot, the samples are intermingled regardless of conditions. I understand that there are few significant genes in this case. However, while I expect that the distribution of p-value is almost flat in the case, I currently have some depletion around 0 in the plot of p-value distribution as you see figure (c) and (d). Could you explain me how to understand this?

ADD REPLY • link 6.0 years ago bioinfo • 0

0

Entering edit mode

It could be from unaccounted-for heterogeneity, and there are packages that help you estimate surrogate variables or factors of unwanted variation.

In general, I don't like post-processing the p-values to induce some differential results unless there's a clear case for it, and that p-value histogram isn't a good case for it.

ADD REPLY • link 6.0 years ago Michael Love 41k

0

Entering edit mode

Thanks for your help. I agree with you. I don't want to do post-processing the p-values to have significance. I can conclude that they do not have significant genes. But it is still unclear that why the distribution of p-value is not flat in the case of no significant gene. If I understand you correctly, if we have unaccounted heterogeneity, it is possible for the distribution of p-values to have depletion near zero. Could you confirm my understanding?