Hi, I'm using DESeq2 to identify differentially affected histone modifications in cells from control and treated human populations. I have 10 control and 10 treated ChIPseq datasets. I've done the analysis in two ways, the only difference being how the data were initially mapped to the genome.
In the first instance there was no threshold for sequences mapping to multiple sites; anything that could be mapped was kept (and assigned a mapping location in the way that Bowtie does this).
And in the second, reads were tossed that mapped to >3 genomic locations. The difference in total numbers of mapped reads was quite small in the end, perhaps 1-2 percent, but the background is a little higher when the reads were mapped without the threshold.
DESeq2 yielded a handful of significant hits (BH-corrected pvalues <0.05) using the count table made from data mapped in the first way (no mapping threshold), but running DESeq2 using data mapped with the more stringent threshold yielded nothing of significance, the raw pvalues are 10-100 fold worse overall and most of the genes were scored (padj) as NA using independent filtering. In the first analysis, independent filtering removed only one gene, whereas in the second, independent filtering removed most genes.
I presume that what is happening is that when I map the reads with higher stringency, the background is lower and this is throwing off the analysis by possibly putting many of my genes in the presumed mud. When I turn off independent filtering using the count data derived from the stringent mapping, now my top hits in the first analysis pop up toward the top of the list, but still the BH correct pvalues are very poor (~0.3). The total number of genes in the two analyses is almost the same, and when I look at ~20 genes at random, the raw and normalized read counts are virtually identical in the two analyses. So this appears to be a very big effect on the outcome derived from a change in background level which is itself quite modest.
I didn't realize that mapping parameters could so strongly influence the downstream analysis. Ironically, I already know the hits I originally got are real because they have now been validated.
Could anyone offer advice about how to boost the power of the DESeq2 analysis of my high stringency mapped data? In other words, how can I account for the loss in statistical power resulting from simply reducing the background read level? Should I set a baseMean threshold (not sure exactly how to do that). Or?
I apologize if this is unintelligible or naive. Happy holidays one and all-
David Auble
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] DESeq2_1.2.10 RcppArmadillo_0.4.100.2.1 Rcpp_0.11.0
[4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.7
[7] Biobase_2.22.0 BiocGenerics_0.8.0
loaded via a namespace (and not attached):
[1] annotate_1.40.1 AnnotationDbi_1.24.0 DBI_0.2-7 genefilter_1.44.0
[5] grid_3.0.2 lattice_0.20-23 locfit_1.5-9.1 RColorBrewer_1.0-5
[9] RSQLite_0.11.4 splines_3.0.2 stats4_3.0.2 survival_2.37-4
[13] tools_3.0.2 XML_3.95-0.2 xtable_1.7-3
hi David,
Something might be wrong in the scripts upstream, because I don't see how a higher stringency could result in higher gene counts for sample 1, 8, 17, 18.
The counts for these are shifted enough that I would expect an impact on the significance testing.
The two results tables should have the same length unless you have pre-filtered the count table. If so you can make the plot by creating an index of common row names:
Then plot using
res1[common,]$pvalue
instead ofres1$pvalue.
Hi Mike,
I think I see now... yes, something appears wrong with those four samples and I appreciate your straightforward and incisive help in figuring this out. I don't think it's our scripts per se since most of the files look ok by these and other criteria. Now, I'll go back over the analyses of these four to see if I can figure out what happened. I have monkeyed around with this problem for quite some time- as mentioned above, the simple minded things I've done didn't suggest anything weird. So this is very very helpful.
I don't think it matters so much at this point, but I wasn't able to generate the plot you suggested-
> common=intersect(rownames(res),rownames(res1))
> plot(-log10(res$pvalue),-log10(res1[common,]$pvalue));abline(0,1)
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
Thanks again and Happy New Year-
David