False positives due to GC content correction - DESeq2
1
0
Entering edit mode
Guest User ★ 12k
@guest-user-4897
Last seen 8.1 years ago
Hi Mike, I have been trying to use DESeq2 for a differential analysis of Chipseq data using 8 T/N pairs. There is a lot of heterogeneity in the samples due to clinical differences ( tumor stage etc), total mapped reads ( some samples are much better than the others), batch effects ( since they were processed at different times and not by the same person). I wanted to correct atleast some of the biases starting with GC content and what I did was to use offsets from EDAseq as an input to DESeq2 and introduced the batch variable in the model. What I dont understand is that when I corrected for GC bias in the samples, the final results tend to have a lot of false positives. I have attached the dispersion plots for both the runs. I cant seem to figure why -- output of sessionInfo(): - -- Sent via the guest posting facility at bioconductor.org.
EDASeq DESeq2 EDASeq DESeq2 • 1.4k views
0
Entering edit mode
@mikelove
Last seen 3 hours ago
United States
hi Aditi, Please include all the code you used for EDAseq and DESeq2, and the sessionInfo() How do you know there are false positive? Are these genes which you know are not differentially expressed? Your dispersion plots didn't come through. You can email those attachments to my email address, and we will continue discussion on the Bioc list. Mike On Fri, Aug 8, 2014 at 1:54 PM, Aditi [guest] <guest at="" bioconductor.org=""> wrote: > Hi Mike, > > I have been trying to use DESeq2 for a differential analysis of Chipseq data using 8 T/N pairs. There is a lot of heterogeneity in the samples due to clinical differences ( tumor stage etc), total mapped reads ( some samples are much better than the others), batch effects ( since they were processed at different times and not by the same person). I wanted to correct atleast some of the biases starting with GC content and what I did was to use offsets from EDAseq as an input to DESeq2 and introduced the batch variable in the model. > > What I dont understand is that when I corrected for GC bias in the samples, the final results tend to have a lot of false positives. I have attached the dispersion plots for both the runs. I cant seem to figure why > > > -- output of sessionInfo(): > > - > > -- > Sent via the guest posting facility at bioconductor.org.
0
Entering edit mode
0
Entering edit mode
hi Aditi, Your code looks correct to me. Also the normalization factors are correctly taking into account sequencing depth, which is what I wanted to check on by looking at scatterplots for normalized counts of pairs of samples. I took a look at the results, and I also see as you say, the additional genes after using GC correction: > res <- results(dds) > res2 <- results(dds2_nongc) > table(gc.correct=res$padj < .1, no.correct=res2$padj < .1) no.correct gc.correct FALSE TRUE FALSE 20810 143 TRUE 368 472 Ideally, we can have additional genes showing up as significant if we have reduced technical noise through modeling the normalization factors using the technical covariates like GC content. But you suspect these new genes. Can you explain how you know that these are false positive? And is it just the genes which are added after GC correction which are enriched with FP? Mike On Fri, Aug 8, 2014 at 2:29 PM, QAMRA Aditi (GIS) <qamraa99 at="" gis.a-star.edu.sg=""> wrote: > Hi Mike, > > Sorry seems like my message got cut midway. What I was saying was that I don't understand how can I estimate what could be the source of these false positives. Yes these are regions that I know are not differentially expressed. > > I've attached the code for the analysis as well the dispersion plots. > > Session Info - > R version 3.1.0 (2014-04-10) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] EDASeq_1.10.0 aroma.light_2.0.0 matrixStats_0.10.0 > [4] ShortRead_1.22.0 GenomicAlignments_1.0.3 BSgenome_1.32.0 > [7] Rsamtools_1.16.1 Biostrings_2.32.1 XVector_0.4.0 > [10] BiocParallel_0.6.1 Biobase_2.24.0 DESeq2_1.4.5 > [13] RcppArmadillo_0.4.320.0 Rcpp_0.11.2 GenomicRanges_1.16.3 > [16] GenomeInfoDb_1.0.2 IRanges_1.22.10 BiocGenerics_0.10.0 > [19] BiocInstaller_1.14.2 > > loaded via a namespace (and not attached): > [1] annotate_1.42.1 AnnotationDbi_1.26.0 BatchJobs_1.3 > [4] BBmisc_1.7 bitops_1.0-6 brew_1.0-6 > [7] checkmate_1.2 codetools_0.2-8 DBI_0.2-7 > [10] DESeq_1.16.0 digest_0.6.4 fail_1.2 > [13] foreach_1.4.2 genefilter_1.46.1 geneplotter_1.42.0 > [16] grid_3.1.0 hwriter_1.3 iterators_1.0.7 > [19] lattice_0.20-29 latticeExtra_0.6-26 locfit_1.5-9.1 > [22] RColorBrewer_1.0-5 R.methodsS3_1.6.1 R.oo_1.18.0 > [25] RSQLite_0.11.4 sendmailR_1.1-2 splines_3.1.0 > [28] stats4_3.1.0 stringr_0.6.2 survival_2.37-7 > [31] tools_3.1.0 XML_3.98-1.1 xtable_1.7-3 > [34] zlibbioc_1.10.0 > > > > > > > ________________________________________ > From: Michael Love [michaelisaiahlove at gmail.com] > Sent: Saturday, August 09, 2014 2:11 AM > To: Aditi [guest] > Cc: bioconductor at r-project.org; QAMRA Aditi (GIS) > Subject: Re: False positives due to GC content correction - DESeq2 > > hi Aditi, > > Please include all the code you used for EDAseq and DESeq2, and the > sessionInfo() > > How do you know there are false positive? Are these genes which you > know are not differentially expressed? > > Your dispersion plots didn't come through. You can email those > attachments to my email address, and we will continue discussion on > the Bioc list. > > Mike > > On Fri, Aug 8, 2014 at 1:54 PM, Aditi [guest] <guest at="" bioconductor.org=""> wrote: >> Hi Mike, >> >> I have been trying to use DESeq2 for a differential analysis of Chipseq data using 8 T/N pairs. There is a lot of heterogeneity in the samples due to clinical differences ( tumor stage etc), total mapped reads ( some samples are much better than the others), batch effects ( since they were processed at different times and not by the same person). I wanted to correct atleast some of the biases starting with GC content and what I did was to use offsets from EDAseq as an input to DESeq2 and introduced the batch variable in the model. >> >> What I dont understand is that when I corrected for GC bias in the samples, the final results tend to have a lot of false positives. I have attached the dispersion plots for both the runs. I cant seem to figure why >> >> >> -- output of sessionInfo(): >> >> - >> >> -- >> Sent via the guest posting facility at bioconductor.org. > > ------------------------------- > This e-mail and any attachments are only for the use of the intended recipient and may be confidential and/or privileged. If you are not the recipient, please delete it or notify the sender immediately. Please do not copy or use it for any purpose or disclose the contents to any other person as it may be an offence under the Official Secrets Act. > -------------------------------
0
Entering edit mode
Hi Michael, Yes the regions that are added after GC correction are mostly regions with very low read count and while some correspond to genes/regions I know from beforehand are not different, others mark regions that on looking at the bedgraph tracks show no difference in the read count. Aditi ________________________________________ From: Michael Love [michaelisaiahlove@gmail.com] Sent: Saturday, August 09, 2014 5:31 AM To: QAMRA Aditi (GIS) Cc: bioconductor at r-project.org Subject: Re: False positives due to GC content correction - DESeq2 hi Aditi, Your code looks correct to me. Also the normalization factors are correctly taking into account sequencing depth, which is what I wanted to check on by looking at scatterplots for normalized counts of pairs of samples. I took a look at the results, and I also see as you say, the additional genes after using GC correction: > res <- results(dds) > res2 <- results(dds2_nongc) > table(gc.correct=res$padj < .1, no.correct=res2$padj < .1) no.correct gc.correct FALSE TRUE FALSE 20810 143 TRUE 368 472 Ideally, we can have additional genes showing up as significant if we have reduced technical noise through modeling the normalization factors using the technical covariates like GC content. But you suspect these new genes. Can you explain how you know that these are false positive? And is it just the genes which are added after GC correction which are enriched with FP? Mike On Fri, Aug 8, 2014 at 2:29 PM, QAMRA Aditi (GIS) <qamraa99 at="" gis.a-star.edu.sg=""> wrote: > Hi Mike, > > Sorry seems like my message got cut midway. What I was saying was that I don't understand how can I estimate what could be the source of these false positives. Yes these are regions that I know are not differentially expressed. > > I've attached the code for the analysis as well the dispersion plots. > > Session Info - > R version 3.1.0 (2014-04-10) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] EDASeq_1.10.0 aroma.light_2.0.0 matrixStats_0.10.0 > [4] ShortRead_1.22.0 GenomicAlignments_1.0.3 BSgenome_1.32.0 > [7] Rsamtools_1.16.1 Biostrings_2.32.1 XVector_0.4.0 > [10] BiocParallel_0.6.1 Biobase_2.24.0 DESeq2_1.4.5 > [13] RcppArmadillo_0.4.320.0 Rcpp_0.11.2 GenomicRanges_1.16.3 > [16] GenomeInfoDb_1.0.2 IRanges_1.22.10 BiocGenerics_0.10.0 > [19] BiocInstaller_1.14.2 > > loaded via a namespace (and not attached): > [1] annotate_1.42.1 AnnotationDbi_1.26.0 BatchJobs_1.3 > [4] BBmisc_1.7 bitops_1.0-6 brew_1.0-6 > [7] checkmate_1.2 codetools_0.2-8 DBI_0.2-7 > [10] DESeq_1.16.0 digest_0.6.4 fail_1.2 > [13] foreach_1.4.2 genefilter_1.46.1 geneplotter_1.42.0 > [16] grid_3.1.0 hwriter_1.3 iterators_1.0.7 > [19] lattice_0.20-29 latticeExtra_0.6-26 locfit_1.5-9.1 > [22] RColorBrewer_1.0-5 R.methodsS3_1.6.1 R.oo_1.18.0 > [25] RSQLite_0.11.4 sendmailR_1.1-2 splines_3.1.0 > [28] stats4_3.1.0 stringr_0.6.2 survival_2.37-7 > [31] tools_3.1.0 XML_3.98-1.1 xtable_1.7-3 > [34] zlibbioc_1.10.0 > > > > > > > ________________________________________ > From: Michael Love [michaelisaiahlove at gmail.com] > Sent: Saturday, August 09, 2014 2:11 AM > To: Aditi [guest] > Cc: bioconductor at r-project.org; QAMRA Aditi (GIS) > Subject: Re: False positives due to GC content correction - DESeq2 > > hi Aditi, > > Please include all the code you used for EDAseq and DESeq2, and the > sessionInfo() > > How do you know there are false positive? Are these genes which you > know are not differentially expressed? > > Your dispersion plots didn't come through. You can email those > attachments to my email address, and we will continue discussion on > the Bioc list. > > Mike > > On Fri, Aug 8, 2014 at 1:54 PM, Aditi [guest] <guest at="" bioconductor.org=""> wrote: >> Hi Mike, >> >> I have been trying to use DESeq2 for a differential analysis of Chipseq data using 8 T/N pairs. There is a lot of heterogeneity in the samples due to clinical differences ( tumor stage etc), total mapped reads ( some samples are much better than the others), batch effects ( since they were processed at different times and not by the same person). I wanted to correct atleast some of the biases starting with GC content and what I did was to use offsets from EDAseq as an input to DESeq2 and introduced the batch variable in the model. >> >> What I dont understand is that when I corrected for GC bias in the samples, the final results tend to have a lot of false positives. I have attached the dispersion plots for both the runs. I cant seem to figure why >> >> >> -- output of sessionInfo(): >> >> - >> >> -- >> Sent via the guest posting facility at bioconductor.org. > > ------------------------------- > This e-mail and any attachments are only for the use of the intended recipient and may be confidential and/or privileged. If you are not the recipient, please delete it or notify the sender immediately. Please do not copy or use it for any purpose or disclose the contents to any other person as it may be an offence under the Official Secrets Act. > ------------------------------- ------------------------------- This e-mail and any attachments are only for the use of the intended recipient and may be confidential and/or privileged. If you are not the recipient, please delete it or notify the sender immediately. Please do not copy or use it for any purpose or disclose the contents to any other person as it may be an offence under the Official Secrets Act.