"Normalising" GRanges coverage object

0

Entering edit mode

Aliaksei Holik ▴ 350

@aliaksei-holik-4992

Last seen 10.0 years ago

Spain/Barcelona/Centre for Genomic Regu…

Dear Bioconductors and especially *Ranges team, I would like your expert opinion on whether what I'm doing is a sensible thing to do. I wish to plot some ChIP read coverage data in Gviz. The way I do it is by reading a number of bam files with readGAlignments() and calculating the coverage with the likewise named function. I would also like to normalise the coverage for each bam file for library size and the Input sample. I do so by simply dividing the coverage objects by the respective library size and by the coverage object for the Input sample. It all works beautifully, and the graphs look ok. But I'm left with a nagging feeling, that it's too easy to be true. So I guess my question is: am I right to assume that by dividing one coverage object by another, the coverage at each position in one object would be divided by the coverage at the same very position in the other object. I'm almost expecting something to go wrong, if, for instance, I have 0 coverage at some position in my reference object. But perhaps I should have faith in Bioconductor. Many thanks, Aliaksei.

Coverage GO Gviz Coverage GO Gviz • 3.4k views

ADD COMMENT • link updated 11.5 years ago by Ryan C. Thompson ★ 7.9k • written 11.5 years ago by Aliaksei Holik ▴ 350

1

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 16 months ago

Icahn School of Medicine at Mount Sinai…

Hi Aliaksei, First of all, dividing by library size is not the way to go. It's wrong in ChIP-seq for the same reason it's wrong in RNA-seq: compositional bias. In order to properly normalize for sequencing depth, I would recommend the strategy described here: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4066778/ You just need to look at the small section about calculating normalization factors using 10kb bins. Once you calculate the normalization factors, multiply them by the nominal library sizes to get the normalized library sizes. Then divide the raw coverage by the normalized library sizes to get the normalized coverage, which has been adjusted for both sequencing depth and compositional bias. You should be able to normalize input and ChIP samples together in this way as long as your ChIP peaks are not omnipresent throughout the entire genome. From there, you could theoretically subtract from each sample the corresponding input (or maybe the average of all inputs, since they should all be equivalent) to get enrichment above input, or divide to get fold enrichment over input. You are correct in your supposition that arithmetic operations on Rle objects (i.e. the type of object returned by coverage) work element-wise, just like arithmetic on ordinary R numeric vectors. Of course you are correct that if the input coverage is zero at any point, then dividing by it will result in an infinite ratio, which you probably don't want. You could take another page out of the RNA-seq playbook and add a small constant to each coverage object before dividing to regularize the ratios. You could also smooth out the input coverage by replacing each coverage value with a mean over a window centered at that position. You can do this with the runmean function from the caTools package (from CRAN, not Bioconductor), using options endrule="mean" and align="center". Depending on the sparseness of your coverage, sufficient smoothing might eliminate zero values. Anyway, that's about all I have to suggest. I hope you find something useful in there. -Ryan On Fri Aug 8 10:19:39 2014, Aliaksei Holik wrote: > Dear Bioconductors and especially *Ranges team, > > I would like your expert opinion on whether what I'm doing is a > sensible thing to do. > > I wish to plot some ChIP read coverage data in Gviz. The way I do it > is by reading a number of bam files with readGAlignments() and > calculating the coverage with the likewise named function. I would > also like to normalise the coverage for each bam file for library size > and the Input sample. I do so by simply dividing the coverage objects > by the respective library size and by the coverage object for the > Input sample. It all works beautifully, and the graphs look ok. But > I'm left with a nagging feeling, that it's too easy to be true. So I > guess my question is: am I right to assume that by dividing one > coverage object by another, the coverage at each position in one > object would be divided by the coverage at the same very position in > the other object. I'm almost expecting something to go wrong, if, for > instance, I have 0 coverage at some position in my reference object. > But perhaps I should have faith in Bioconductor. > > Many thanks, > > Aliaksei. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 11.5 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

That's great. Thanks Ryan! On 09/08/2014 04:40, Ryan wrote: > Hi Aliaksei, > > First of all, dividing by library size is not the way to go. It's wrong > in ChIP-seq for the same reason it's wrong in RNA-seq: compositional > bias. In order to properly normalize for sequencing depth, I would > recommend the strategy described here: > http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4066778/ > > You just need to look at the small section about calculating > normalization factors using 10kb bins. Once you calculate the > normalization factors, multiply them by the nominal library sizes to get > the normalized library sizes. Then divide the raw coverage by the > normalized library sizes to get the normalized coverage, which has been > adjusted for both sequencing depth and compositional bias. You should be > able to normalize input and ChIP samples together in this way as long as > your ChIP peaks are not omnipresent throughout the entire genome. > > From there, you could theoretically subtract from each sample the > corresponding input (or maybe the average of all inputs, since they > should all be equivalent) to get enrichment above input, or divide to > get fold enrichment over input. You are correct in your supposition that > arithmetic operations on Rle objects (i.e. the type of object returned > by coverage) work element-wise, just like arithmetic on ordinary R > numeric vectors. > > Of course you are correct that if the input coverage is zero at any > point, then dividing by it will result in an infinite ratio, which you > probably don't want. You could take another page out of the RNA-seq > playbook and add a small constant to each coverage object before > dividing to regularize the ratios. You could also smooth out the input > coverage by replacing each coverage value with a mean over a window > centered at that position. You can do this with the runmean function > from the caTools package (from CRAN, not Bioconductor), using options > endrule="mean" and align="center". Depending on the sparseness of your > coverage, sufficient smoothing might eliminate zero values. > > Anyway, that's about all I have to suggest. I hope you find something > useful in there. > > -Ryan > > On Fri Aug 8 10:19:39 2014, Aliaksei Holik wrote: >> Dear Bioconductors and especially *Ranges team, >> >> I would like your expert opinion on whether what I'm doing is a >> sensible thing to do. >> >> I wish to plot some ChIP read coverage data in Gviz. The way I do it >> is by reading a number of bam files with readGAlignments() and >> calculating the coverage with the likewise named function. I would >> also like to normalise the coverage for each bam file for library size >> and the Input sample. I do so by simply dividing the coverage objects >> by the respective library size and by the coverage object for the >> Input sample. It all works beautifully, and the graphs look ok. But >> I'm left with a nagging feeling, that it's too easy to be true. So I >> guess my question is: am I right to assume that by dividing one >> coverage object by another, the coverage at each position in one >> object would be divided by the coverage at the same very position in >> the other object. I'm almost expecting something to go wrong, if, for >> instance, I have 0 coverage at some position in my reference object. >> But perhaps I should have faith in Bioconductor. >> >> Many thanks, >> >> Aliaksei. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD REPLY • link 11.5 years ago Aliaksei Holik ▴ 350

Login before adding your answer.