Using DESeq with ChIP-seq data - all or non-redundant reads?

0

Entering edit mode

Ian Donaldson ▴ 70

@ian-donaldson-4761

Last seen 9.6 years ago

I have been using DESeq to look at differential binding in ChIP-seq for a while now. But recently we have been discussing locally whether the ChIP-seq reads used in DESeq should be the full or non-redundant set? There is a worry that the full set of reads may contain spuriously amplified reads, but then using a non-redundant set remove information, i.e. particularly enriched binding regions. I would be very interested to get your views on this. Thanks! Ian ________________________________________ From: bioconductor-bounces@r-project.org [bioconductor- bounces@r-project.org] on behalf of Simon Anders [anders@embl.de] Sent: 20 July 2011 14:20 To: bioconductor at r-project.org Subject: Re: [BioC] Using DESeq with ChIP-seq data Hi Ian On 07/20/2011 02:18 PM, Simon Anders wrote: > What I meant is: Pool all four samples, give them to the peak finder in > one big chunk and so get a list of binding regions. Then, count for each > sample how many reads fall into each of the binding regions, obtaining a > table with four columns for your four samples and one row for each > binding region found in the pool. Give this table to DESeq. We've tried > this approach once with some Pol-II ChIP-Seq data and it worked quite well. Forgot to mention: When we did this, we counted the reads from the ChIPed sample. We used the input control samples only for the peak finding, not in the counting. IIRC, we only had one common control lane for both conditions, so that it would cancel out when comparing the conditions. If you have separate controls, you may want to count for them as well and use DESeq's GLM function to test for an interaction contrast. S _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

DESeq DESeq • 1.3k views

ADD COMMENT • link updated 12.5 years ago by Ivan Gregoretti ▴ 310 • written 12.5 years ago by Ian Donaldson ▴ 70

0

Entering edit mode

Ivan Gregoretti ▴ 310

@ivan-gregoretti-3975

Last seen 9.6 years ago

Canada

Hello Ian, I think that, in general, removing duplicates is good praxis in ChIP- seq. Of course, when you have very high coverage, veracious but identically positioned tags will be mistaken as PCR duplicated. How is that affecting you? You run the risk of underestimating the signal strength of stronger peaks rather than weak ones. Removal of duplicates affects more stronger peaks.The weaker the peak, the less likely it is to be marked by veracious duplicates. So, removing duplicates, even veracious ones, will not make your weakest signals disappear, which is critical. If instead of peak intensity you only care about peak location, then, duplicate removal should be used without reserve. As always, opinions that disagree are welcome. Ivan Ivan Gregoretti, PhD National Institute of Diabetes and Digestive and Kidney Diseases National Institutes of Health 5 Memorial Dr, Building 5, Room 205. Bethesda, MD 20892. USA. Phone: 1-301-496-1016 and 1-301-496-1592 Fax: 1-301-496-9878 On Tue, Oct 18, 2011 at 10:09 AM, Ian Donaldson <ian.donaldson at="" manchester.ac.uk=""> wrote: > I have been using DESeq to look at differential binding in ChIP-seq for a while now. ?But recently we have been discussing locally whether the ChIP-seq reads used in DESeq should be the full or non-redundant set? ?There is a worry that the full set of reads may contain spuriously amplified reads, but then using a non-redundant set remove information, i.e. particularly enriched binding regions. > > I would be very interested to get your views on this. > > Thanks! > > Ian > ________________________________________ > From: bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org] on behalf of Simon Anders [anders at embl.de] > Sent: 20 July 2011 14:20 > To: bioconductor at r-project.org > Subject: Re: [BioC] Using DESeq with ChIP-seq data > > Hi Ian > > On 07/20/2011 02:18 PM, Simon Anders wrote: >> What I meant is: Pool all four samples, give them to the peak finder in >> one big chunk and so get a list of binding regions. Then, count for each >> sample how many reads fall into each of the binding regions, obtaining a >> table with four columns for your four samples and one row for each >> binding region found in the pool. Give this table to DESeq. We've tried >> this approach once with some Pol-II ChIP-Seq data and it worked quite well. > > Forgot to mention: When we did this, we counted the reads from the > ChIPed sample. We used the input control samples only for the peak > finding, not in the counting. IIRC, we only had one common control lane > for both conditions, so that it would cancel out when comparing the > conditions. > > If you have separate controls, you may want to count for them as well > and use DESeq's GLM function to test for an interaction contrast. > > ? S > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 12.5 years ago Ivan Gregoretti ▴ 310

0

Entering edit mode

Ian- Given that you are using DESeq to look at differential binding, the relative magnitudes of the peaks is crucial to the calculations. Removing duplicates will clip the dynamic range such that no peak can have a greater magnitude than the read length. In our case, where we use 36-base single-end reads for most of our ChIPs, this clipping is so severe as to render a quantitative differential binding analysis (such as you are attempting) of limited use. We check the duplication rates of our ChIPs. For Inputs and other controls, we expect the rate to be below 10% (generally below 5%). If the accompanying ChIP is below about 20%, we include all the reads when doing a differential analysis. Above that we do some other assessments of ChIP quality, and may re-run the entire ChIP, or ask for another replicate (with enough replicates, the analysis should be tolerant of some PCR duplication bias). And of course, any binding sites that are identified as differentially bound need to be examined to see if they are being driven by duplicate reads in one or more replicates. A quick plug: we have a new package in Bioconductor 2.9 called DiffBind that does exactly this type of differential binding analysis using edgeR and/or DESeq -- it may be of use to you. Cheers- Rory ---------------------------------------------------------------------- ----- - Dr. Rory Stark Sr. Computational Biology Analyst Cambridge Research Institute - Cancer Research UK Li Ka Shing Centre Robinson Way Cambridge CB2 0RE United Kingdom +44 1223 404 311 ---------------------------------------------------------------------- ----- - On 20/10/2011 14:35, "Ivan Gregoretti" <ivangreg at="" gmail.com=""> wrote: >Hello Ian, > >I think that, in general, removing duplicates is good praxis in ChIP- seq. > >Of course, when you have very high coverage, veracious but identically >positioned tags will be mistaken as PCR duplicated. > >How is that affecting you? >You run the risk of underestimating the signal strength of stronger >peaks rather than weak ones. > >Removal of duplicates affects more stronger peaks.The weaker the peak, >the less likely it is to be marked by veracious duplicates. So, >removing duplicates, even veracious ones, will not make your weakest >signals disappear, which is critical. > >If instead of peak intensity you only care about peak location, then, >duplicate removal should be used without reserve. > >As always, opinions that disagree are welcome. > >Ivan > >Ivan Gregoretti, PhD >National Institute of Diabetes and Digestive and Kidney Diseases >National Institutes of Health >5 Memorial Dr, Building 5, Room 205. >Bethesda, MD 20892. USA. >Phone: 1-301-496-1016 and 1-301-496-1592 >Fax: 1-301-496-9878 > > > >On Tue, Oct 18, 2011 at 10:09 AM, Ian Donaldson ><ian.donaldson at="" manchester.ac.uk=""> wrote: >> I have been using DESeq to look at differential binding in ChIP-seq for >>a while now. But recently we have been discussing locally whether the >>ChIP-seq reads used in DESeq should be the full or non-redundant set? >>There is a worry that the full set of reads may contain spuriously >>amplified reads, but then using a non-redundant set remove information, >>i.e. particularly enriched binding regions. >> >> I would be very interested to get your views on this. >> >> Thanks! >> >> Ian >> ________________________________________ >> From: bioconductor-bounces at r-project.org >>[bioconductor-bounces at r-project.org] on behalf of Simon Anders >>[anders at embl.de] >> Sent: 20 July 2011 14:20 >> To: bioconductor at r-project.org >> Subject: Re: [BioC] Using DESeq with ChIP-seq data >> >> Hi Ian >> >> On 07/20/2011 02:18 PM, Simon Anders wrote: >>> What I meant is: Pool all four samples, give them to the peak finder in >>> one big chunk and so get a list of binding regions. Then, count for >>>each >>> sample how many reads fall into each of the binding regions, obtaining >>>a >>> table with four columns for your four samples and one row for each >>> binding region found in the pool. Give this table to DESeq. We've tried >>> this approach once with some Pol-II ChIP-Seq data and it worked quite >>>well. >> >> Forgot to mention: When we did this, we counted the reads from the >> ChIPed sample. We used the input control samples only for the peak >> finding, not in the counting. IIRC, we only had one common control lane >> for both conditions, so that it would cancel out when comparing the >> conditions. >> >> If you have separate controls, you may want to count for them as well >> and use DESeq's GLM function to test for an interaction contrast. >> >> S >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >>http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >>http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >_______________________________________________ >Bioconductor mailing list >Bioconductor at r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor NOTICE AND DISCLAIMER This e-mail (including any attachments) is intended for ...{{dropped:16}}

ADD REPLY • link 12.5 years ago Rory Stark ▴ 100

0

Entering edit mode

Thank you for your response Ivan! I completely agree that removing duplicates is a necessary step for peak calling. I seems as though keeping duplicates is a "double edged sword" where it is not easy to separate PCR artifactual reads from real ones. I think what, to me, makes using all the reads seem appealing/necessary is that in differential binding you want to see the actual differences in binding intensity (reads). If non-redundant reads are used then isn't the difference in binding intensity lost or apparently reduced (maybe this does not matter in the DESeq analysis? Thanks again! Ian ________________________________________ From: Ivan Gregoretti [ivangreg@gmail.com] Sent: 20 October 2011 14:35 To: Ian Donaldson Cc: Simon Anders; bioconductor at r-project.org Subject: Re: [BioC] Using DESeq with ChIP-seq data - all or non- redundant reads? Hello Ian, I think that, in general, removing duplicates is good praxis in ChIP- seq. Of course, when you have very high coverage, veracious but identically positioned tags will be mistaken as PCR duplicated. How is that affecting you? You run the risk of underestimating the signal strength of stronger peaks rather than weak ones. Removal of duplicates affects more stronger peaks.The weaker the peak, the less likely it is to be marked by veracious duplicates. So, removing duplicates, even veracious ones, will not make your weakest signals disappear, which is critical. If instead of peak intensity you only care about peak location, then, duplicate removal should be used without reserve. As always, opinions that disagree are welcome. Ivan Ivan Gregoretti, PhD National Institute of Diabetes and Digestive and Kidney Diseases National Institutes of Health 5 Memorial Dr, Building 5, Room 205. Bethesda, MD 20892. USA. Phone: 1-301-496-1016 and 1-301-496-1592 Fax: 1-301-496-9878 On Tue, Oct 18, 2011 at 10:09 AM, Ian Donaldson <ian.donaldson at="" manchester.ac.uk=""> wrote: > I have been using DESeq to look at differential binding in ChIP-seq for a while now. But recently we have been discussing locally whether the ChIP-seq reads used in DESeq should be the full or non-redundant set? There is a worry that the full set of reads may contain spuriously amplified reads, but then using a non-redundant set remove information, i.e. particularly enriched binding regions. > > I would be very interested to get your views on this. > > Thanks! > > Ian > ________________________________________ > From: bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org] on behalf of Simon Anders [anders at embl.de] > Sent: 20 July 2011 14:20 > To: bioconductor at r-project.org > Subject: Re: [BioC] Using DESeq with ChIP-seq data > > Hi Ian > > On 07/20/2011 02:18 PM, Simon Anders wrote: >> What I meant is: Pool all four samples, give them to the peak finder in >> one big chunk and so get a list of binding regions. Then, count for each >> sample how many reads fall into each of the binding regions, obtaining a >> table with four columns for your four samples and one row for each >> binding region found in the pool. Give this table to DESeq. We've tried >> this approach once with some Pol-II ChIP-Seq data and it worked quite well. > > Forgot to mention: When we did this, we counted the reads from the > ChIPed sample. We used the input control samples only for the peak > finding, not in the counting. IIRC, we only had one common control lane > for both conditions, so that it would cancel out when comparing the > conditions. > > If you have separate controls, you may want to count for them as well > and use DESeq's GLM function to test for an interaction contrast. > > S > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 12.5 years ago Ian Donaldson ▴ 70

Login before adding your answer.