summarizeOverlaps vs countOverlaps

0

Entering edit mode

swaraj basu ▴ 50

@swaraj-basu-4629

Last seen 10.5 years ago

Dear Valerie, Thanks a lot for your explanation. This makes it a lot clear to me. My problem is if I have to get RPKM value from read count then what should I explicitly trust, summarizeOverlaps or countOverlaps. What I have understood is, to work on gene RPKM I should rely upon summarizeOverlaps while to work on RPKM values of transcripts I should consider countOverlaps. Do let me know if I am correct in my assumption. regards Swaraj On Wed, Feb 29, 2012 at 9:07 PM, Valerie Obenchain <vobencha@fhcrc.org>wrote: > Hello, > > You should expect different results from countOverlaps and > summarizeOverlaps because they count in different ways. > > The different modes of summarizeOverlaps counting are explained in the > vignette, "Overview of summarizeOverlaps", found by > > browseVignettes("**GenomicRanges") or at > > http://bioconductor.org/**packages/2.10/bioc/html/**GenomicRanges.ht ml<http: bioconductor.org="" packages="" 2.10="" bioc="" html="" genomicranges.html=""> > > summarizeOverlaps was developed to provide count methods that do not > "double count" a read that hit more than one feature (where a feature is an > exon, transcript etc.). summarizeOverlaps also treats a GRanges and > GRangesList differently. This is explained in the man page and in the > vignette. Once you've had a chance to read the documentation let me know if > this is not clear and I'd be happy to explain with examples. > > Note that the GappedAlignments container is gap-aware. When a > GappedAlignment is coerced to a GRangesList the portions of the gapped read > become distinct rows in the new GRanges obejct. When a GappedAlignment is > coerced to a GRanges the gaps are collapsed and the range is continuous. In > other words we are no longer aware of our gaps. > > ## Here we create the three objects with the same reads, > > galn <- GappedAlignments( > names = c("a","b","c"), > rname = Rle(rep("chr1", 3)), > pos = as.integer(c(1400, 3100, 5200)), > cigar = c("500M", "50M200N50M", "50M150N50M"), > strand = strand(rep("+", 3))) > gr <- as(galn, "GRanges") > grl <- as(galn, "GRangesList") > > ## Create a subject for overlap testing that falls in the gap of the > second range > > > subj <- GRanges("chr1", IRanges(3200, 3300)) > > subj > GRanges with 1 range and 0 elementMetadata cols: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr1 [3200, 3300] * > --- > seqlengths: > chr1 > NA > > ## Investigate how this range is counted > > > countOverlaps(galn, subj) > a b c > 0 0 0 > > countOverlaps(gr, subj) > a b c > 0 1 0 > > countOverlaps(grl, subj) > a b c > 0 0 0 > > > Valerie > > > On 02/29/12 10:24, swaraj basu wrote: > >> Dear All, >> >> I am getting completely different results for the count of mapped reads on >> transcript features using the countOverlaps and summarizeOverlaps. >> >> CountOverlaps on using a gappedAlignment and Granges object gives slightly >> different results. SummarizeOverlaps gives totally different results for >> the modes "Union" and "intersectionNotEmpty". >> >> Can someone please let me know where I am making a mistake. I have >> attached >> a screenshot of the transcripts. All the transcripts belong to the same >> gene. >> >> Following steps were performed by me on a small subset of my data >> >> 1. I built a gapped alignment object of mapped reads using the command >> reads=readBamGappedAlignments(**"path / to / bam/sorted.bam") >> >> 2. From the gapped alignment I built a genomic ranges object >> ranges=GRanges(seqnames=rname(**reads), >> ranges=IRanges(start=start(**reads), >> end=end(reads)),strand=rep("*"**,length(reads))) >> >> 3. I built a genomic ranges list from a custom genedb object (4 >> transcripts). >> geneE=exonsBy(genedb,'gene') >> >> 4. Used the count and summarize overlaps functions >> a). count1=countOverlaps(geneE, reads) >> b). count2=countOverlaps(geneE, ranges) >> c). count3=assays(**summarizeOverlaps(geneE, reads, >> mode="UNION"))$counts[,1] >> c). count4=assays(**summarizeOverlaps(geneE, reads, >> mode="intersectionNotEmpty"))$**counts[,1] >> >> 5. I get four different count results >> "T1, T2, T3, T4" >> a). count1 "55972, 41029, 18270, 18734" >> b). count2 "55989, 41046, 18287, 18751" >> c). count3 "0, 2, 0, 1092" >> d). count4 "0, 3712, 0, 2550" >> >> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > -- Swaraj Basu PhD Student (Bioinformatics - Functional Genomics) Animal Physiology and Evolution Stazione Zoologica Anton Dohrn Naples [[alternative HTML version deleted]]

Alignment Alignment • 2.3k views

ADD COMMENT • link updated 13.0 years ago by Valerie Obenchain ★ 6.8k • written 13.0 years ago by swaraj basu ▴ 50

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 3.1 years ago

United States

Hi Swaraj, Both countOverlaps and summarizeOverlaps identify overlaps between the 'query' and 'subject' or 'features' and 'reads' arguments. Neither method is specific for exons, transcripts, genes etc. They both can be used for any ranges the user defines. If we supply gene ranges, the results are overlaps or counts for the genes. ranges <- some gene ranges summarizeOverlaps(ranges, reads) countOverlaps(ranges, reads) If we define transcripts ranges, the results are counts for the transcripts. ranges <- some tx ranges summarizeOverlaps(ranges, reads) countOverlaps(ranges, reads) The difference in these methods is in how you want to count. countOverlaps will return a 'hit' for each query the subject overlaps. This can result in more hits than reads if a read hits multiple subjects. summarizeOverlaps provides 3 modes of counting all of which count a read only once. The 3 modes have different logic for deciding which subject the read is assigned to. On the man page for summarizeOverlaps (?summarizeOverlaps) I have an example that compares the counts produced by summarizeOverlaps to those produced by countOverlaps. People have different opinions about whether or not reads should be 'double counted'. This is a personal preference and you'll have to decide how you want to count your reads given your biological question. Valerie On 03/01/2012 01:09 AM, swaraj basu wrote: > Dear Valerie, > > Thanks a lot for your explanation. This makes it a lot clear to me. My > problem is if I have to get RPKM value from read count then what > should I explicitly trust, summarizeOverlaps or countOverlaps. What I > have understood is, to work on gene RPKM I should rely upon > summarizeOverlaps while to work on RPKM values of transcripts I should > consider countOverlaps. Do let me know if I am correct in my assumption. > > regards > > Swaraj > > On Wed, Feb 29, 2012 at 9:07 PM, Valerie Obenchain <vobencha@fhcrc.org> <mailto:vobencha@fhcrc.org>> wrote: > > Hello, > > You should expect different results from countOverlaps and > summarizeOverlaps because they count in different ways. > > The different modes of summarizeOverlaps counting are explained in > the vignette, "Overview of summarizeOverlaps", found by > > browseVignettes("GenomicRanges") or at > > http://bioconductor.org/packages/2.10/bioc/html/GenomicRanges.html > > summarizeOverlaps was developed to provide count methods that do > not "double count" a read that hit more than one feature (where a > feature is an exon, transcript etc.). summarizeOverlaps also > treats a GRanges and GRangesList differently. This is explained in > the man page and in the vignette. Once you've had a chance to read > the documentation let me know if this is not clear and I'd be > happy to explain with examples. > > Note that the GappedAlignments container is gap-aware. When a > GappedAlignment is coerced to a GRangesList the portions of the > gapped read become distinct rows in the new GRanges obejct. When a > GappedAlignment is coerced to a GRanges the gaps are collapsed and > the range is continuous. In other words we are no longer aware of > our gaps. > > ## Here we create the three objects with the same reads, > > galn <- GappedAlignments( > names = c("a","b","c"), > rname = Rle(rep("chr1", 3)), > pos = as.integer(c(1400, 3100, 5200)), > cigar = c("500M", "50M200N50M", "50M150N50M"), > strand = strand(rep("+", 3))) > gr <- as(galn, "GRanges") > grl <- as(galn, "GRangesList") > > ## Create a subject for overlap testing that falls in the gap of > the second range > > > subj <- GRanges("chr1", IRanges(3200, 3300)) > > subj > GRanges with 1 range and 0 elementMetadata cols: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr1 [3200, 3300] * > --- > seqlengths: > chr1 > NA > > ## Investigate how this range is counted > > > countOverlaps(galn, subj) > a b c > 0 0 0 > > countOverlaps(gr, subj) > a b c > 0 1 0 > > countOverlaps(grl, subj) > a b c > 0 0 0 > > > Valerie > > > On 02/29/12 10:24, swaraj basu wrote: > > Dear All, > > I am getting completely different results for the count of > mapped reads on > transcript features using the countOverlaps and summarizeOverlaps. > > CountOverlaps on using a gappedAlignment and Granges object > gives slightly > different results. SummarizeOverlaps gives totally different > results for > the modes "Union" and "intersectionNotEmpty". > > Can someone please let me know where I am making a mistake. I > have attached > a screenshot of the transcripts. All the transcripts belong to > the same > gene. > > Following steps were performed by me on a small subset of my data > > 1. I built a gapped alignment object of mapped reads using the > command > reads=readBamGappedAlignments("path / to / bam/sorted.bam") > > 2. From the gapped alignment I built a genomic ranges object > ranges=GRanges(seqnames=rname(reads), > ranges=IRanges(start=start(reads), > end=end(reads)),strand=rep("*",length(reads))) > > 3. I built a genomic ranges list from a custom genedb object (4 > transcripts). > geneE=exonsBy(genedb,'gene') > > 4. Used the count and summarize overlaps functions > a). count1=countOverlaps(geneE, reads) > b). count2=countOverlaps(geneE, ranges) > c). count3=assays(summarizeOverlaps(geneE, reads, > mode="UNION"))$counts[,1] > c). count4=assays(summarizeOverlaps(geneE, reads, > mode="intersectionNotEmpty"))$counts[,1] > > 5. I get four different count results > "T1, T2, T3, T4" > a). count1 "55972, 41029, 18270, 18734" > b). count2 "55989, 41046, 18287, 18751" > c). count3 "0, 2, 0, 1092" > d). count4 "0, 3712, 0, 2550" > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org <mailto:bioconductor@r-project.org> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > -- > Swaraj Basu > PhD Student (Bioinformatics - Functional Genomics) > Animal Physiology and Evolution > Stazione Zoologica Anton Dohrn > Naples > [[alternative HTML version deleted]]

ADD COMMENT • link 13.0 years ago Valerie Obenchain ★ 6.8k

Login before adding your answer.