DEseq for chip-seq data normalisation

0

Entering edit mode

Giuseppe Gallone ▴ 170

@giuseppe-gallone-6092

Last seen 9.6 years ago

Hi I would like to use DEseq or DEseq2 to normalise the peak signal for some Chip-seq data across 10 biological replicates. I started looking at the DEseq documentation - it seems the program requires a matrix arrangement of raw count data, where each row is a peak and each column is a replicate. What is the best way to obtain this? I have bam files for the reads, obtained with BWA, and bed files (or alternatively narrowPeak files) for the peak intervals, obtained using MACS. I gather it is possible to use a program called HTseq to compute these counts, however this program seems unable to deal with bed files, only with gff files, and I'd prefer working directly with my beds if at all possible. Thank you. Best regards Giuseppe

DESeq DESeq2 DESeq DESeq2 • 3.9k views

ADD COMMENT • link updated 10.4 years ago by Ying W ▴ 90 • written 10.5 years ago by Giuseppe Gallone ▴ 170

1

Entering edit mode

Lucia Peixoto ▴ 330

@lucia-peixoto-4203

Last seen 9.6 years ago

Hi Giuseppe, Unfortunately there is not much available to do stats on ChIPseq data. It is my experience that the data shows exactly the same overdispersion problem that is see in RNAseq so using either EdgeR, DEseq or DEseq2 to analyze ChIPseq data is the way to go. There are a couple of challenges along the way that make this undertaking not quite straightforward.The only bioconductor package that I know tries to tackle this issues is DiffBind, so you can give it a try. One of the main differences is that unlike gene or exon coordinates, peaks in your individual replicates will not be exactly in the same place, if you are working with TF data this will not be too bad, but anything nucleosome associated will have considerable phase shift from replicate to replicate. So you first have to do some sort of merging of reproducible peaks into regions. I do not recommend doing the peak calling with the pooled data.After doing several ChIP-seq experiments with replicates I have observed that a lot of peaks, even ones with high z-scores/low p-values, do not show up in more than one replicate (but maybe this is particular to my type of experiments). Merging all the peaks leads to a high number of false positives. So you need to integrate the peak locations into a single file but make sure you have a minimum number of carriers for each peak, I usually do presence in at least 2 of the replicates. You can make a gff file that you can feed into HTSeq in which you define the reproducible peak regions on your samples as if it was the gff with the gene models, but making this file takes a little bit of work. We are currently preparing a package for CRAN submission to specifically integrate the analysis of ChIP-seq data with replicates to EdgeR and DESeq, addressing most of what I mentioned above and including a peak caller for ease of flow of the analysis.I cannot finish the submission until the accompanying biological paper is out, so it won't be available until next year. hope this was helpful best Lucia On Mon, Nov 4, 2013 at 8:47 AM, Giuseppe Gallone < giuseppe.gallone@dpag.ox.ac.uk> wrote: > Hi > > I would like to use DEseq or DEseq2 to normalise the peak signal for some > Chip-seq data across 10 biological replicates. > > I started looking at the DEseq documentation - it seems the program > requires a matrix arrangement of raw count data, where each row is a peak > and each column is a replicate. > > What is the best way to obtain this? I have bam files for the reads, > obtained with BWA, and bed files (or alternatively narrowPeak files) for > the peak intervals, obtained using MACS. > > I gather it is possible to use a program called HTseq to compute these > counts, however this program seems unable to deal with bed files, only with > gff files, and I'd prefer working directly with my beds if at all possible. > Thank you. > > Best regards > Giuseppe > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane. > science.biology.informatics.conductor > -- Lucia Peixoto PhD Postdoctoral Research Fellow Laboratory of Dr. Ted Abel Department of Biology School of Arts and Sciences University of Pennsylvania "Think boldly, don't be afraid of making mistakes, don't miss small details, keep your eyes open, and be modest in everything except your aims." Albert Szent-Gyorgyi [[alternative HTML version deleted]]

ADD COMMENT • link 10.5 years ago Lucia Peixoto ▴ 330

0

Entering edit mode

Hello Lucia this is all great info! Thank you very much for taking the time to share your findings. I am indeed using diffbind, and found some interesting results, however now I'd need to access directly my downsampled mapped reads. With diffbind, I was basically throwing in my mapped reads and my peak intervals and getting a differential analysis out (after setting up a contrast). Now I'd like to try something slightly different and need to work with normalised bams of my samples. The problem I have is that my 10 bam have wildly varying numbers of mapping reads. I would like to downsample them all to a minimum common before examining quantitative differences in the peak signals across them. I was hoping I could do this with DEseq: feed it some bams and obtain normalised versions of them. But I understand this is not possible? I guess I will try to downsample my bams by myself using for example picard and then take it from there. Are there maybe some alternatives you'd suggest? I know MACS also allows to downsample bam. Thanks! Giuseppe On 11/04/13 18:13, Lucia Peixoto wrote: > Hi Giuseppe, > Unfortunately there is not much available to do stats on ChIPseq data. > It is my experience that the data shows exactly the same overdispersion > problem that is see in RNAseq so using either EdgeR, DEseq or DEseq2 to > analyze ChIPseq data is the way to go. There are a couple of challenges > along the way that make this undertaking not quite straightforward.The > only bioconductor package that I know tries to tackle this issues is > DiffBind, so you can give it a try. > > One of the main differences is that unlike gene or exon coordinates, > peaks in your individual replicates will not be exactly in the same > place, if you are working with TF data this will not be too bad, but > anything nucleosome associated will have considerable phase shift from > replicate to replicate. So you first have to do some sort of merging of > reproducible peaks into regions. > > I do not recommend doing the peak calling with the pooled data.After > doing several ChIP-seq experiments with replicates I have observed that > a lot of peaks, even ones with high z-scores/low p-values, do not show > up in more than one replicate (but maybe this is particular to my type > of experiments). Merging all the peaks leads to a high number of false > positives. So you need to integrate the peak locations into a single > file but make sure you have a minimum number of carriers for each peak, > I usually do presence in at least 2 of the replicates. > You can make a gff file that you can feed into HTSeq in which you define > the reproducible peak regions on your samples as if it was the gff with > the gene models, but making this file takes a little bit of work. > We are currently preparing a package for CRAN submission to specifically > integrate the analysis of ChIP-seq data with replicates to EdgeR and > DESeq, addressing most of what I mentioned above and including a peak > caller for ease of flow of the analysis.I cannot finish the submission > until the accompanying biological paper is out, so it won't be available > until next year. > > hope this was helpful > best > > Lucia > > > > On Mon, Nov 4, 2013 at 8:47 AM, Giuseppe Gallone > <giuseppe.gallone at="" dpag.ox.ac.uk="" <mailto:giuseppe.gallone="" at="" dpag.ox.ac.uk="">> > wrote: > > Hi > > I would like to use DEseq or DEseq2 to normalise the peak signal for > some Chip-seq data across 10 biological replicates. > > I started looking at the DEseq documentation - it seems the program > requires a matrix arrangement of raw count data, where each row is a > peak and each column is a replicate. > > What is the best way to obtain this? I have bam files for the reads, > obtained with BWA, and bed files (or alternatively narrowPeak files) > for the peak intervals, obtained using MACS. > > I gather it is possible to use a program called HTseq to compute > these counts, however this program seems unable to deal with bed > files, only with gff files, and I'd prefer working directly with my > beds if at all possible. Thank you. > > Best regards > Giuseppe > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > > -- > Lucia Peixoto PhD > Postdoctoral Research Fellow > Laboratory of Dr. Ted Abel > Department of Biology > School of Arts and Sciences > University of Pennsylvania > > "Think boldly, don't be afraid of making mistakes, don't miss small > details, keep your eyes open, and be modest in everything except your > aims." > Albert Szent-Gyorgyi

ADD REPLY • link 10.4 years ago Giuseppe Gallone ▴ 170

0

Entering edit mode

hi Giuseppe, DiffBind calls either edgeR, DESeq or DESeq2 during the processing pipeline. >From reading the DiffBind 1.8.2 vignette, I see on page 37: The final results (as a DESeqDataSet) are accessible within the DBA object > as > DBA$contrasts[[n]]$DESeq2$DEdata > and may be examined and manipulated directly for further customization. So then you can get DESeq2 normalized counts at this point with: dds <- DBA$contrasts[[n]]$DESeq2$DEdata counts(dds, normalized=TRUE) Mike On Tue, Nov 5, 2013 at 7:57 AM, Giuseppe Gallone < giuseppe.gallone@dpag.ox.ac.uk> wrote: > Hello Lucia > > this is all great info! Thank you very much for taking the time to share > your findings. I am indeed using diffbind, and found some interesting > results, however now I'd need to access directly my downsampled mapped > reads. > > With diffbind, I was basically throwing in my mapped reads and my peak > intervals and getting a differential analysis out (after setting up a > contrast). > > Now I'd like to try something slightly different and need to work with > normalised bams of my samples. The problem I have is that my 10 bam have > wildly varying numbers of mapping reads. I would like to downsample them > all to a minimum common before examining quantitative differences in the > peak signals across them. > > I was hoping I could do this with DEseq: feed it some bams and obtain > normalised versions of them. But I understand this is not possible? > > I guess I will try to downsample my bams by myself using for example > picard and then take it from there. Are there maybe some alternatives you'd > suggest? I know MACS also allows to downsample bam. Thanks! > > Giuseppe > > > On 11/04/13 18:13, Lucia Peixoto wrote: > >> Hi Giuseppe, >> Unfortunately there is not much available to do stats on ChIPseq data. >> It is my experience that the data shows exactly the same overdispersion >> problem that is see in RNAseq so using either EdgeR, DEseq or DEseq2 to >> analyze ChIPseq data is the way to go. There are a couple of challenges >> along the way that make this undertaking not quite straightforward.The >> only bioconductor package that I know tries to tackle this issues is >> DiffBind, so you can give it a try. >> >> One of the main differences is that unlike gene or exon coordinates, >> peaks in your individual replicates will not be exactly in the same >> place, if you are working with TF data this will not be too bad, but >> anything nucleosome associated will have considerable phase shift from >> replicate to replicate. So you first have to do some sort of merging of >> reproducible peaks into regions. >> >> I do not recommend doing the peak calling with the pooled data.After >> doing several ChIP-seq experiments with replicates I have observed that >> a lot of peaks, even ones with high z-scores/low p-values, do not show >> up in more than one replicate (but maybe this is particular to my type >> of experiments). Merging all the peaks leads to a high number of false >> positives. So you need to integrate the peak locations into a single >> file but make sure you have a minimum number of carriers for each peak, >> I usually do presence in at least 2 of the replicates. >> You can make a gff file that you can feed into HTSeq in which you define >> the reproducible peak regions on your samples as if it was the gff with >> the gene models, but making this file takes a little bit of work. >> We are currently preparing a package for CRAN submission to specifically >> integrate the analysis of ChIP-seq data with replicates to EdgeR and >> DESeq, addressing most of what I mentioned above and including a peak >> caller for ease of flow of the analysis.I cannot finish the submission >> until the accompanying biological paper is out, so it won't be available >> until next year. >> >> hope this was helpful >> best >> >> Lucia >> >> >> >> On Mon, Nov 4, 2013 at 8:47 AM, Giuseppe Gallone >> <giuseppe.gallone@dpag.ox.ac.uk <mailto:giuseppe.gallone@dpag.ox.ac.uk="">> >> >> wrote: >> >> Hi >> >> I would like to use DEseq or DEseq2 to normalise the peak signal for >> some Chip-seq data across 10 biological replicates. >> >> I started looking at the DEseq documentation - it seems the program >> requires a matrix arrangement of raw count data, where each row is a >> peak and each column is a replicate. >> >> What is the best way to obtain this? I have bam files for the reads, >> obtained with BWA, and bed files (or alternatively narrowPeak files) >> for the peak intervals, obtained using MACS. >> >> I gather it is possible to use a program called HTseq to compute >> these counts, however this program seems unable to deal with bed >> files, only with gff files, and I'd prefer working directly with my >> beds if at all possible. Thank you. >> >> Best regards >> Giuseppe >> >> _________________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-project.org> >> https://stat.ethz.ch/mailman/__listinfo/bioconductor >> >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.__science.biology.informatics.__conductor< >> http://news.gmane.org/gmane.science.biology.informatics.conductor> >> >> >> >> >> >> -- >> Lucia Peixoto PhD >> Postdoctoral Research Fellow >> Laboratory of Dr. Ted Abel >> Department of Biology >> School of Arts and Sciences >> University of Pennsylvania >> >> "Think boldly, don't be afraid of making mistakes, don't miss small >> details, keep your eyes open, and be modest in everything except your >> aims." >> Albert Szent-Gyorgyi >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane. > science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 10.5 years ago Michael Love 41k

0

Entering edit mode

Hi Mike thanks a lot for this, exactly what I needed. Best Giuseppe On 05/11/2013 13:54, Michael Love wrote: > hi Giuseppe, > > DiffBind calls either edgeR, DESeq or DESeq2 during the processing pipeline. > > From reading the DiffBind 1.8.2 vignette, I see on page 37: > > The final results (as a DESeqDataSet) are accessible within the DBA > object as > DBA$contrasts[[n]]$DESeq2$DEdata > and may be examined and manipulated directly for further customization. > > > So then you can get DESeq2 normalized counts at this point with: > > dds <- DBA$contrasts[[n]]$DESeq2$DEdata > counts(dds, normalized=TRUE) > > Mike

ADD REPLY • link 10.4 years ago Giuseppe Gallone ▴ 170

0

Entering edit mode

Dear Lucia, On 05/nov/2013, at 13:57, Giuseppe Gallone <giuseppe.gallone at="" dpag.ox.ac.uk=""> wrote: >> I do not recommend doing the peak calling with the pooled data.After >> doing several ChIP-seq experiments with replicates I have observed that >> a lot of peaks, even ones with high z-scores/low p-values, do not show >> up in more than one replicate (but maybe this is particular to my type >> of experiments). Merging all the peaks leads to a high number of false >> positives. So you need to integrate the peak locations into a single >> file but make sure you have a minimum number of carriers for each peak, >> I usually do presence in at least 2 of the replicates. This sounds reasonable and, indeed, I have a similar anedoctical experience. Do you have some data/benchmarks supporting this claim? Thanks, d /* Davide Cittaro, PhD Coordinator of Bioinformatics Core Center for Translational Genomics and Bioinformatics Ospedale San Raffaele Via Olgettina 58 20132 Milano Italy Office: +39 02 26439211 Mail: cittaro.davide at hsr.it Skype: daweonline */

ADD REPLY • link 10.4 years ago Cittaro Davide ▴ 240

0

Entering edit mode

Hi, I always think I should use IDR from ENCODE to get some <conservative> peaks from replicates, then move to Diffbind/DESeq. How do you think? PS: IDR: https://sites.google.com/site/anshulkundaje/projects/idr Regards, Sheng On Wed, Nov 6, 2013 at 10:18 AM, Davide Cittaro <cittaro.davide@hsr.it>wrote: > Dear Lucia, > > On 05/nov/2013, at 13:57, Giuseppe Gallone <giuseppe.gallone@dpag.ox.ac.uk> > wrote: > > >> I do not recommend doing the peak calling with the pooled data.After > >> doing several ChIP-seq experiments with replicates I have observed that > >> a lot of peaks, even ones with high z-scores/low p-values, do not show > >> up in more than one replicate (but maybe this is particular to my type > >> of experiments). Merging all the peaks leads to a high number of false > >> positives. So you need to integrate the peak locations into a single > >> file but make sure you have a minimum number of carriers for each peak, > >> I usually do presence in at least 2 of the replicates. > > This sounds reasonable and, indeed, I have a similar anedoctical > experience. Do you have some data/benchmarks supporting this claim? > Thanks, > > d > > > /* > Davide Cittaro, PhD > > Coordinator of Bioinformatics Core > Center for Translational Genomics and Bioinformatics > Ospedale San Raffaele > Via Olgettina 58 > 20132 Milano > Italy > > Office: +39 02 26439211 > Mail: cittaro.davide@hsr.it > Skype: daweonline > */ > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 10.4 years ago sheng zhao ▴ 90

0

Entering edit mode

Hello Sheng I have had biologically interesting results when selecting overlapping peaks between replicates using the IDR. I first call peaks with MACS2 or SPP, using a very loose q-value threshold, and then pass these to the IDR. However, in my experience this works fine between pairs of aligned read files. If you have many samples, my experience reflects Lucia's: plotting an overlap function and then, based on the overlap behaviour, choosing a minimum number of carriers for each peak position (in my case 2-3) works well to create a consensus peakset. Regards Giuseppe On 11/06/13 09:40, pbczyd . wrote: > Hi, > I always think I should use IDR from ENCODE to get some <conservative> > peaks from replicates, then move to Diffbind/DESeq. > > How do you think? > > PS: IDR: https://sites.google.com/site/anshulkundaje/projects/idr > > Regards, > Sheng > -- Giuseppe Gallone, PhD MRC career development fellow MRC Functional Genomics Unit - DPAG University of Oxford, UK

ADD REPLY • link 10.4 years ago Giuseppe Gallone ▴ 170

0

Entering edit mode

Rory Stark ★ 5.1k

@rory-stark-5741

Last seen 4 days ago

Cambridge, UK

Hi Guiseppe- You can retrieve the complete matrix of read counts from DiffBind, either normalized or not, using dba.peakset with bRetrieve=TRUE. To can set the score to use via dba.count with peaks=NULL and score=DBA_SCORE_READS, or any of the other possible score values. The default score is DBA_SCORE_TMM_MINUS_FULL, which is normalized using edgeR's TMM method, after subtracting the reads in the control, and using the full library size (not just the reads in peaks) as a scalar. Cheers- Rory On Mon, Nov 4, 2013 at 8:47 AM, Giuseppe Gallone < giuseppe.gallone at dpag.ox.ac.uk> wrote: >Hi > >I would like to use DEseq or DEseq2 to normalise the peak signal for some >Chip-seq data across 10 biological replicates. > >I started looking at the DEseq documentation - it seems the program >requires a matrix arrangement of raw count data, where each row is a peak >and each column is a replicate. > >What is the best way to obtain this? I have bam files for the reads, >obtained with BWA, and bed files (or alternatively narrowPeak files) for >the peak intervals, obtained using MACS. > >I gather it is possible to use a program called HTseq to compute these >counts, however this program seems unable to deal with bed files, only >with >gff files, and I'd prefer working directly with my beds if at all >possible. >Thank you. > >Best regards >Giuseppe > >_______________________________________________ >Bioconductor mailing list >Bioconductor at r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: http://news.gmane.org/gmane. >science.biology.informatics.conductor

ADD COMMENT • link 10.4 years ago Rory Stark ★ 5.1k

0

Entering edit mode

Thanks a lot Rory. Do you think it would then make sense to use the normalised counts in the peaks to build a ratio based on the raw count and then feed this ratio to, say, picard to get a downsampled bam from the original bam? Best Giuseppe On 11/05/13 18:18, Rory Stark wrote: > Hi Guiseppe- > > You can retrieve the complete matrix of read counts from DiffBind, either > normalized or not, using dba.peakset with bRetrieve=TRUE. To can set the > score to use via dba.count with peaks=NULL and score=DBA_SCORE_READS, or > any of the other possible score values. The default score is > DBA_SCORE_TMM_MINUS_FULL, which is normalized using edgeR's TMM method, > after subtracting the reads in the control, and using the full library > size (not just the reads in peaks) as a scalar. > > Cheers- > Rory >

ADD REPLY • link 10.4 years ago Giuseppe Gallone ▴ 170

0

Entering edit mode

Ying W ▴ 90

@ying-w-4341

Last seen 8.7 years ago

United States

Hi Rory, Could you give some insight into why TMM is used with full library size, it seems to make sense for effective library size case but where full library size is used, would it still be valid to try and minimize changes between conditions? Best, -Ying On 11/05/13 18:18, Rory Stark wrote: > Hi Guiseppe- > > You can retrieve the complete matrix of read counts from DiffBind, either > normalized or not, using dba.peakset with bRetrieve=TRUE. To can set the > score to use via dba.count with peaks=NULL and score=DBA_SCORE_READS, or > any of the other possible score values. The default score is > DBA_SCORE_TMM_MINUS_FULL, which is normalized using edgeR's TMM method, > after subtracting the reads in the control, and using the full library > size (not just the reads in peaks) as a scalar. > > Cheers- > Rory >

ADD COMMENT • link 10.4 years ago Ying W ▴ 90

0

Entering edit mode

Hi Ying- We actually just changed the default normalization from effective to full library size in the most recent release. The reason is that while effective is frequently a better choice, it is based on the assumption that overall binding levels in all the samples is similar. When this assumption is incorrect, it can result in substantially incorrect results; using full library size when effective applies results in less catastrophically wrong answers. I will definitely be changing normalization to use effective sizes when it is the right thing to do, but I have become aware that many (most?) DiffBind users don't change the defaults, so we determined that a more conservative default was preferable. I'm not sure what you're asking regarding "try and minimize changes between conditions" in this context? Cheers- Rory On 06/11/2013 19:08, "Ying Wu" <daiyingw at="" gmail.com=""> wrote: >Hi Rory, >Could you give some insight into why TMM is used with full library size, >it seems to make sense for effective library size case but where full >library size is used, would it still be valid to try and minimize >changes between conditions? > >Best, >-Ying > >On 11/05/13 18:18, Rory Stark wrote: > > Hi Guiseppe- > > > > You can retrieve the complete matrix of read counts from DiffBind, >either > > normalized or not, using dba.peakset with bRetrieve=TRUE. To can set >the > > score to use via dba.count with peaks=NULL and score=DBA_SCORE_READS, >or > > any of the other possible score values. The default score is > > DBA_SCORE_TMM_MINUS_FULL, which is normalized using edgeR's TMM method, > > after subtracting the reads in the control, and using the full library > > size (not just the reads in peaks) as a scalar. > > > > Cheers- > > Rory > > > > >

ADD REPLY • link 10.4 years ago Rory Stark ★ 5.1k

0

Entering edit mode

Hi Rory, From my understanding, TMM normalization will try to minimize the fold changes between conditions (assumes that most regions are not differentially expressed) and works best with about equal number of up and down regulated genes (For the second part see some of the work done by Kadota K. on extending TMM http://pubmed.gov/22475125) Using full library size, one no longer assumes that overall binding levels in all samples are similar. If the motivation behind using full library size is that overall binding levels are very different, wouldn't most regions then be differentially bound and thus TMM's assumption that most regions are unchanged also be invalid? Best, Ying On 11/6/2013 11:16 AM, Rory Stark wrote: > Hi Ying- > > We actually just changed the default normalization from effective to full > library size in the most recent release. The reason is that while > effective is frequently a better choice, it is based on the assumption > that overall binding levels in all the samples is similar. When this > assumption is incorrect, it can result in substantially incorrect results; > using full library size when effective applies results in less > catastrophically wrong answers. > > I will definitely be changing normalization to use effective sizes when it > is the right thing to do, but I have become aware that many (most?) > DiffBind users don't change the defaults, so we determined that a more > conservative default was preferable. > > I'm not sure what you're asking regarding "try and minimize changes > between conditions" in this context? > > Cheers- > Rory > > On 06/11/2013 19:08, "Ying Wu" <daiyingw at="" gmail.com=""> wrote: > >> Hi Rory, >> Could you give some insight into why TMM is used with full library size, >> it seems to make sense for effective library size case but where full >> library size is used, would it still be valid to try and minimize >> changes between conditions? >> >> Best, >> -Ying >> >> On 11/05/13 18:18, Rory Stark wrote: >>> Hi Guiseppe- >>> >>> You can retrieve the complete matrix of read counts from DiffBind, >> either >>> normalized or not, using dba.peakset with bRetrieve=TRUE. To can set >> the >>> score to use via dba.count with peaks=NULL and score=DBA_SCORE_READS, >> or >>> any of the other possible score values. The default score is >>> DBA_SCORE_TMM_MINUS_FULL, which is normalized using edgeR's TMM method, >>> after subtracting the reads in the control, and using the full library >>> size (not just the reads in peaks) as a scalar. >>> >>> Cheers- >>> Rory >>> >> >>

ADD REPLY • link 10.4 years ago Ying W ▴ 90

0

Entering edit mode

Hi, It is true that TMM and similar methods are based on the assumption that most binding levels are similar across samples, and this assumption may not be true. However, Using raw library sizes (i.e. sum of all counts in a sample) also imposes an assumption that is at least as bad, which is that the total amount of binding in each sample is constant. In other words, using raw library size is equivalent to assuming that there is a fixed amount of binding that is allocated differently across the genome in each sample. In my opinion, if I had to choose one of those two assumptions, I would almost always choose the former and go with TMM. -Ryan On 11/6/13, 11:16 AM, Rory Stark wrote: > Hi Ying- > > We actually just changed the default normalization from effective to full > library size in the most recent release. The reason is that while > effective is frequently a better choice, it is based on the assumption > that overall binding levels in all the samples is similar. When this > assumption is incorrect, it can result in substantially incorrect results; > using full library size when effective applies results in less > catastrophically wrong answers. > > I will definitely be changing normalization to use effective sizes when it > is the right thing to do, but I have become aware that many (most?) > DiffBind users don't change the defaults, so we determined that a more > conservative default was preferable. > > I'm not sure what you're asking regarding "try and minimize changes > between conditions" in this context? > > Cheers- > Rory > > On 06/11/2013 19:08, "Ying Wu" <daiyingw at="" gmail.com=""> wrote: > >> Hi Rory, >> Could you give some insight into why TMM is used with full library size, >> it seems to make sense for effective library size case but where full >> library size is used, would it still be valid to try and minimize >> changes between conditions? >> >> Best, >> -Ying >> >> On 11/05/13 18:18, Rory Stark wrote: >>> Hi Guiseppe- >>> >>> You can retrieve the complete matrix of read counts from DiffBind, >> either >>> normalized or not, using dba.peakset with bRetrieve=TRUE. To can set >> the >>> score to use via dba.count with peaks=NULL and score=DBA_SCORE_READS, >> or >>> any of the other possible score values. The default score is >>> DBA_SCORE_TMM_MINUS_FULL, which is normalized using edgeR's TMM method, >>> after subtracting the reads in the control, and using the full library >>> size (not just the reads in peaks) as a scalar. >>> >>> Cheers- >>> Rory >>> >> >> > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 10.4 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Rory Stark ★ 5.1k

@rory-stark-5741

Last seen 4 days ago

Cambridge, UK

Hi Guiseppe- I'm not sure why you want to downsample? The normalization is supposed to take care of differences in read depth and distribution of read densities amongst peaks. edgeR/DESeq/DESeq2 do calculate overall scaling factors for each library as part of the normalization computation, so it may be useful to retrieve those to see how each library is being weighted. This would be better than basically reverse-enginering it by calculating the ratios between the original values and the normalized ones. Cheers- R >Date: Wed, 06 Nov 2013 10:48:56 +0000 >From: Giuseppe Gallone <giuseppe.gallone at="" dpag.ox.ac.uk=""> >Thanks a lot Rory. Do you think it would then make sense to use the >normalised counts in the peaks to build a ratio based on the raw count >and then feed this ratio to, say, picard to get a downsampled bam from >the original bam? > >Best >Giuseppe

ADD COMMENT • link 10.4 years ago Rory Stark ★ 5.1k

0

Entering edit mode

Hi Rory I suppose the analysis I'd like to carry out is conceptually a simpler one than those I was looking at using Diffbind. I'd like to look at the existence-vs-full depletion of peaks at each interval across samples - as opposed to continuous quantitative differences in the peak signal at each interval across samples. Given 10 samples, if I take the consensus peakset generated by DiffBind with, say minOverlap=8, I'd like to look at each of those peak positions and find sample pairs where a peak is there for sample_1 but not there for sample_2. Then a question could be: are there genetic differences in the motif under this peak which might cause the complete depletion? If I had technical replicates, I image I'd be able to use diffbind as follows -set contrast: (sample_1r1, ..., sample_1ri) VS (sample_2r1, ..., sample_2ri) -dba_analyse (sample1 vs sample2) -follow up peaks sites with complete depletion VS signal However I do not have technical replicates and if I understand correctly I cannot use diffbind/edgeR in this case. So my idea is to simple select peak positions with large overlap across sample after bam subsampling to a minimum common value, and then look at (signal/nosignal) pairs for those peak positions manually. Would you recommend to use the overall scaling factors for each library in this case? Cheers G On 11/06/13 11:32, Rory Stark wrote: > Hi Guiseppe- > > I'm not sure why you want to downsample? The normalization is supposed to > take care of differences in read depth and distribution of read densities > amongst peaks. > > edgeR/DESeq/DESeq2 do calculate overall scaling factors for each library > as part of the normalization computation, so it may be useful to retrieve > those to see how each library is being weighted. This would be better than > basically reverse-enginering it by calculating the ratios between the > original values and the normalized ones. > > Cheers- > R > >> Date: Wed, 06 Nov 2013 10:48:56 +0000 >> From: Giuseppe Gallone <giuseppe.gallone at="" dpag.ox.ac.uk="">> >> Thanks a lot Rory. Do you think it would then make sense to use the >> normalised counts in the peaks to build a ratio based on the raw count >> and then feed this ratio to, say, picard to get a downsampled bam from >> the original bam? >> >> Best >> Giuseppe > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 10.4 years ago Giuseppe Gallone ▴ 170

Login before adding your answer.