Question

accounting for background read level in DESeq2

0

Entering edit mode

David Auble • 0

@david-auble-7199

Last seen 9.3 years ago

United States

Hi, I'm using DESeq2 to identify differentially affected histone modifications in cells from control and treated human populations. I have 10 control and 10 treated ChIPseq datasets. I've done the analysis in two ways, the only difference being how the data were initially mapped to the genome.

In the first instance there was no threshold for sequences mapping to multiple sites; anything that could be mapped was kept (and assigned a mapping location in the way that Bowtie does this).

And in the second, reads were tossed that mapped to >3 genomic locations. The difference in total numbers of mapped reads was quite small in the end, perhaps 1-2 percent, but the background is a little higher when the reads were mapped without the threshold.

DESeq2 yielded a handful of significant hits (BH-corrected pvalues <0.05) using the count table made from data mapped in the first way (no mapping threshold), but running DESeq2 using data mapped with the more stringent threshold yielded nothing of significance, the raw pvalues are 10-100 fold worse overall and most of the genes were scored (padj) as NA using independent filtering. In the first analysis, independent filtering removed only one gene, whereas in the second, independent filtering removed most genes.

I presume that what is happening is that when I map the reads with higher stringency, the background is lower and this is throwing off the analysis by possibly putting many of my genes in the presumed mud. When I turn off independent filtering using the count data derived from the stringent mapping, now my top hits in the first analysis pop up toward the top of the list, but still the BH correct pvalues are very poor (~0.3). The total number of genes in the two analyses is almost the same, and when I look at ~20 genes at random, the raw and normalized read counts are virtually identical in the two analyses. So this appears to be a very big effect on the outcome derived from a change in background level which is itself quite modest.

I didn't realize that mapping parameters could so strongly influence the downstream analysis. Ironically, I already know the hits I originally got are real because they have now been validated.

Could anyone offer advice about how to boost the power of the DESeq2 analysis of my high stringency mapped data? In other words, how can I account for the loss in statistical power resulting from simply reducing the background read level? Should I set a baseMean threshold (not sure exactly how to do that). Or?

I apologize if this is unintelligible or naive. Happy holidays one and all-

David Auble

> sessionInfo()

R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] DESeq2_1.2.10 RcppArmadillo_0.4.100.2.1 Rcpp_0.11.0
[4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.7
[7] Biobase_2.22.0 BiocGenerics_0.8.0

loaded via a namespace (and not attached):
[1] annotate_1.40.1 AnnotationDbi_1.24.0 DBI_0.2-7 genefilter_1.44.0
[5] grid_3.0.2 lattice_0.20-23 locfit_1.5-9.1 RColorBrewer_1.0-5
[9] RSQLite_0.11.4 splines_3.0.2 stats4_3.0.2 survival_2.37-4
[13] tools_3.0.2 XML_3.95-0.2 xtable_1.7-3

deseq2 • 1.3k views

ADD COMMENT • link 9.3 years ago David Auble • 0

score 0 · Answer 1 · 2014-12-25

hi David,

A few question about the dataset:

"The difference in total numbers of mapped reads was quite small in the end, perhaps 1-2 percent, but the background is a little higher when the reads were mapped without the threshold."

How does the average raw count look for the different samples before and after the mapping threshold? So, colMeans(counts(dds)). What about the quantiles? apply(counts(dds), 2, quantile, 0:10/10), before and after the mapping threshold

"running DESeq2 using data mapped with the more stringent threshold yielded nothing of significance, the raw pvalues are 10-100 fold worse overall"

This would be surprising if the average counts are actually nearly the same. So if you plot,

plot(-log10(res1$pvalue), -log10(res2$pvalue)); abline(0,1)

the second set of p-values are shifted off the diagonal by 2?

score 0 · Answer 2 · 2014-12-28

Hi Mike,

Thank you so much for responding to my query. For colMeans(counts(dds)) I see the following. (Low- and high-stringency refer to the Bowtie mapping settings I outlined above.

sample	low stringency	high stringency
ctrl1	80.4408107	104.4880321
ctrl2	42.354003	40.40577672
ctrl3	216.53641	212.8593997
ctrl4	302.5399808	300.5543639
ctrl5	108.1967149	105.5024739
ctrl6	203.6510454	199.8211227
ctrl7	28.56431157	27.22699987
ctrl8	224.8380329	269.4394073
ctrl9	171.4260349	169.1143539
ctrl10	169.4779964	167.5347733
treat1	59.73514419	57.26040437
treat2	130.166523	127.2833605
treat3	142.2445145	139.5063418
treat4	292.9078977	290.6150948
treat5	167.9622725	164.5435891
treat6	85.02240993	82.84654025
treat7	23.17270421	31.2105739
treat8	70.81471175	92.71852317
treat9	115.011008	112.8757001
treat10	186.0340089	185.4591988
treat11	177.0594233	175.1608188

I don't see a lot that's different here, but maybe you see something notable.

For apply(counts(dds),2,quantile,0:10/10) I see the following ("low" and "high" refer to mapping stringency and samples are numbered 1-21 in the same order as above).

	1 low	1 high	2 low	2 high	3 low	3 high	4 low	4 high	5 low	5 high	6 low	6 high	7 low	7 high	8 low	8 high	9 low	9 high	10 low	10 high	11 low	11 high	12 low	12 high	13 low	13 high	14 low	14 high	15 low	15 high	16 low	16 high	17 low	17 high	18 low	18 high	19 low	19 high	20 low	20 high	21 low	21 high
0%	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
10%	0	4	4	4	5	5	6	5	4	3	2	2	1	1	0	7	6	6	6	6	4	4	6	6	6	6	6	6	4	4	3	3	0	3	0	7	7	7	4	3	7	7
20%	0	7	6	6	12	11	11	10	7	7	6	5	3	3	0	12	11	10	12	11	6	6	10	9	10	9	13	12	8	8	6	5	0	5	0	11	11	10	10	9	13	12
30%	3	11	8	8	21	19	19	18	12	11	12	11	6	6	9	19	16	15	19	18	9	8	15	14	15	15	23	22	14	13	9	9	1	7	4	16	15	14	18	17	21	20
40%	7	17	11	11	32	30	30	29	18	17	19	18	9	9	17	31	24	23	28	27	12	12	22	21	24	23	36	34	22	21	15	14	5	10	11	23	21	21	27	26	32	31
50%	13	26	15	15	51	49	52	49	28	26	34	31	14	13	30	51	38	36	44	42	17	17	34	32	37	36	60	57	37	35	22	21	8	13	18	34	32	31	44	43	49	48
60%	23	46	22	22	92	90	107	104	49	47	70	66	19	19	55	99	68	66	75	74	27	26	57	55	64	62.4	111	109	70	68	38	37	12	19	31	54	53	52	80	78	83	82
70%	46	96	38	38	202.2	203	268	271	104	103	179	179	26	26	131	243	148	149	154	155	52	51	115	114	132	132	261	264	162	162	79	79	19	30	55	94	105	104.8	174	176	163	164
80%	128	193	68	67	410	407	596.8	601	205	202	408.8	407	41	40	392	517	320	320	310	310	102.8	101	232	231	265	263	565	568	331	327	160	158	36	50	119	163	207	206	347	350	322	321
90%	281	317	113	110	666	657	979	974	327	320	672	659	71	69	785.4	853	530	526	513	508.6	171	166	392	384	429	421	923	916	523	514	257	251.6	69	79	225.4	254	335	332	569.4	570	525	521
100%	4259	13299	13595	10116	13677	9866	13104	9357	13993	9954	6958	4980	13946	10001	7515	7942	25623	18844	19393	14088	20353	14506	19147	14099	18274	13508	16789	10884	17842	12759	11160	8019	3696	15108	10681	23096	32051	24564	19746	15893	22626	17111

I am not sure how to interpret this. In general the numbers appear lower for each sample in the high versus low stringency data, but a notable exception is sample 1 where the opposite is true, particularly at 100%.

I tried to generate the plot but got the following error (likely not a big deal but I don't know how to remedy it).

> plot(-log10(res$pvalue), -log10(res1$pvalue)); abline(0,1)
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ