accounting for background read level in DESeq2
2
0
Entering edit mode
@david-auble-7199
Last seen 10.0 years ago
United States

Hi, I'm using DESeq2 to identify differentially affected histone modifications in cells from control and treated human populations.  I have 10 control and 10 treated ChIPseq datasets.  I've done the analysis in two ways, the only difference being how the data were initially mapped to the genome.  

In the first instance there was no threshold for sequences mapping to multiple sites; anything that could be mapped was kept (and assigned a mapping location in the way that Bowtie does this).  

And in the second, reads were tossed that mapped to >3 genomic locations.  The difference in total numbers of mapped reads was quite small in the end, perhaps 1-2 percent, but the background is a little higher when the reads were mapped without the threshold.  

DESeq2 yielded a handful of significant hits (BH-corrected pvalues <0.05) using the count table made from data mapped in the first way (no mapping threshold), but running DESeq2 using data mapped with the more stringent threshold yielded nothing of significance, the raw pvalues are 10-100 fold worse overall and most of the genes were scored (padj) as NA using independent filtering.  In the first analysis, independent filtering removed only one gene, whereas in the second, independent filtering removed most genes.  

I presume that what is happening is that when I map the reads with higher stringency, the background is lower and this is throwing off the analysis by possibly putting many of my genes in the presumed mud.  When I turn off independent filtering using the count data derived from the stringent mapping, now my top hits in the first analysis pop up toward the top of the list, but still the BH correct pvalues are very poor (~0.3).  The total number of genes in the two analyses is almost the same, and when I look at ~20 genes at random, the raw and normalized read counts are virtually identical in the two analyses.  So this appears to be a very big effect on the outcome derived from a change in background level which is itself quite modest.

I didn't realize that mapping parameters could so strongly influence the downstream analysis.  Ironically, I already know the hits I originally got are real because they have now been validated.  

Could anyone offer advice about how to boost the power of the DESeq2 analysis of my high stringency mapped data?  In other words, how can I account for the loss in statistical power resulting from simply reducing the background read level?  Should I set a baseMean threshold (not sure exactly how to do that).  Or?   

I apologize if this is unintelligible or naive.  Happy holidays one and all-

David Auble     

> sessionInfo()

R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] DESeq2_1.2.10             RcppArmadillo_0.4.100.2.1 Rcpp_0.11.0              
[4] GenomicRanges_1.14.4      XVector_0.2.0             IRanges_1.20.7           
[7] Biobase_2.22.0            BiocGenerics_0.8.0       

loaded via a namespace (and not attached):
 [1] annotate_1.40.1      AnnotationDbi_1.24.0 DBI_0.2-7            genefilter_1.44.0   
 [5] grid_3.0.2           lattice_0.20-23      locfit_1.5-9.1       RColorBrewer_1.0-5  
 [9] RSQLite_0.11.4       splines_3.0.2        stats4_3.0.2         survival_2.37-4     
[13] tools_3.0.2          XML_3.95-0.2         xtable_1.7-3     

deseq2 • 1.4k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 3 hours ago
United States

hi David,

A few question about the dataset:

"The difference in total numbers of mapped reads was quite small in the end, perhaps 1-2 percent, but the background is a little higher when the reads were mapped without the threshold."

How does the average raw count look for the different samples before and after the mapping threshold? So, colMeans(counts(dds)). What about the quantiles? apply(counts(dds), 2, quantile, 0:10/10), before and after the mapping threshold

"running DESeq2 using data mapped with the more stringent threshold yielded nothing of significance, the raw pvalues are 10-100 fold worse overall"

This would be surprising if the average counts are actually nearly the same. So if you plot,

plot(-log10(res1$pvalue), -log10(res2$pvalue)); abline(0,1)

the second set of p-values are shifted off the diagonal by 2?

 

 

ADD COMMENT
0
Entering edit mode
@david-auble-7199
Last seen 10.0 years ago
United States

Hi Mike, 

Thank you so much for responding to my query.  For colMeans(counts(dds)) I see the following.  (Low- and high-stringency refer to the Bowtie mapping settings I outlined above.  

sample low stringency high stringency
ctrl1 80.4408107 104.4880321
ctrl2 42.354003 40.40577672
ctrl3 216.53641 212.8593997
ctrl4 302.5399808 300.5543639
ctrl5 108.1967149 105.5024739
ctrl6 203.6510454 199.8211227
ctrl7 28.56431157 27.22699987
ctrl8 224.8380329 269.4394073
ctrl9 171.4260349 169.1143539
ctrl10 169.4779964 167.5347733
treat1 59.73514419 57.26040437
treat2 130.166523 127.2833605
treat3 142.2445145 139.5063418
treat4 292.9078977 290.6150948
treat5 167.9622725 164.5435891
treat6 85.02240993 82.84654025
treat7 23.17270421 31.2105739
treat8 70.81471175 92.71852317
treat9 115.011008 112.8757001
treat10 186.0340089 185.4591988
treat11 177.0594233 175.1608188

I don't see a lot that's different here, but maybe you see something notable.  

For apply(counts(dds),2,quantile,0:10/10) I see the following ("low" and "high" refer to mapping stringency and samples are numbered 1-21 in the same order as above).

  1 low 1 high 2 low 2 high 3 low 3 high 4 low 4 high 5 low 5 high 6 low  6 high 7 low 7 high 8 low 8 high 9 low 9 high 10 low 10 high 11 low 11 high 12 low 12 high 13 low 13 high 14 low 14 high 15 low 15 high 16 low 16 high 17 low 17 high 18 low 18 high 19 low 19 high 20 low  20 high 21 low 21 high
0% 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10% 0 4 4 4 5 5 6 5 4 3 2 2 1 1 0 7 6 6 6 6 4 4 6 6 6 6 6 6 4 4 3 3 0 3 0 7 7 7 4 3 7 7
20% 0 7 6 6 12 11 11 10 7 7 6 5 3 3 0 12 11 10 12 11 6 6 10 9 10 9 13 12 8 8 6 5 0 5 0 11 11 10 10 9 13 12
30% 3 11 8 8 21 19 19 18 12 11 12 11 6 6 9 19 16 15 19 18 9 8 15 14 15 15 23 22 14 13 9 9 1 7 4 16 15 14 18 17 21 20
40% 7 17 11 11 32 30 30 29 18 17 19 18 9 9 17 31 24 23 28 27 12 12 22 21 24 23 36 34 22 21 15 14 5 10 11 23 21 21 27 26 32 31
50% 13 26 15 15 51 49 52 49 28 26 34 31 14 13 30 51 38 36 44 42 17 17 34 32 37 36 60 57 37 35 22 21 8 13 18 34 32 31 44 43 49 48
60% 23 46 22 22 92 90 107 104 49 47 70 66 19 19 55 99 68 66 75 74 27 26 57 55 64 62.4 111 109 70 68 38 37 12 19 31 54 53 52 80 78 83 82
70% 46 96 38 38 202.2 203 268 271 104 103 179 179 26 26 131 243 148 149 154 155 52 51 115 114 132 132 261 264 162 162 79 79 19 30 55 94 105 104.8 174 176 163 164
80% 128 193 68 67 410 407 596.8 601 205 202 408.8 407 41 40 392 517 320 320 310 310 102.8 101 232 231 265 263 565 568 331 327 160 158 36 50 119 163 207 206 347 350 322 321
90% 281 317 113 110 666 657 979 974 327 320 672 659 71 69 785.4 853 530 526 513 508.6 171 166 392 384 429 421 923 916 523 514 257 251.6 69 79 225.4 254 335 332 569.4 570 525 521
100% 4259 13299 13595 10116 13677 9866 13104 9357 13993 9954 6958 4980 13946 10001 7515 7942 25623 18844 19393 14088 20353 14506 19147 14099 18274 13508 16789 10884 17842 12759 11160 8019 3696 15108 10681 23096 32051 24564 19746 15893 22626 17111

I am not sure how to interpret this.  In general the numbers appear lower for each sample in the high versus low stringency data, but a notable exception is sample 1 where the opposite is true, particularly at 100%.  

I tried to generate the plot but got the following error (likely not a big deal but I don't know how to remedy it).

> plot(-log10(res$pvalue), -log10(res1$pvalue)); abline(0,1)
Error in xy.coords(x, y, xlabel, ylabel, log) : 
  'x' and 'y' lengths differ

ADD COMMENT
0
Entering edit mode

hi David,

Something might be wrong in the scripts upstream, because I don't see how a higher stringency could result in higher gene counts for sample 1, 8, 17, 18.

The counts for these are shifted enough that I would expect an impact on the significance testing.

The two results tables should have the same length unless you have pre-filtered the count table. If so you can make the plot by creating an index of common row names:

common = intersect(rownames(res1), rownames(res2))

Then plot using res1[common,]$pvalue instead of res1$pvalue.

ADD REPLY
0
Entering edit mode

Hi Mike,

I think I see now... yes, something appears wrong with those four samples and I appreciate your straightforward and incisive help in figuring this out.  I don't think it's our scripts per se since most of the files look ok by these and other criteria.  Now, I'll go back over the analyses of these four to see if I can figure out what happened.  I have monkeyed around with this problem for quite some time- as mentioned above, the simple minded things I've done didn't suggest anything weird.  So this is very very helpful.

I don't think it matters so much at this point, but I wasn't able to generate the plot you suggested- 

> common=intersect(rownames(res),rownames(res1))
> plot(-log10(res$pvalue),-log10(res1[common,]$pvalue));abline(0,1)
Error in xy.coords(x, y, xlabel, ylabel, log) : 
  'x' and 'y' lengths differ

Thanks again and Happy New Year-

David

ADD REPLY

Login before adding your answer.

Traffic: 922 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6