multiple hits with countOverlaps function

0

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 3 months ago

Australia/Melbourne

Hi, I am using countOverlaps function in GenomicFeatures package to count how many sequencing reads are mapped to each genomic feature (e.g. promoter region, gene region etc.). However I found that if a read was mapped to more than one features, then the count of each feature will be increased by 1, i.e. this read was counted more than one times although it can only originate from one genomic location. Below is a simple example to show this: ------------------------- features <- GRanges(seqnames="chr1", ranges=IRanges(c(100,500),c(1000,1500)) ,strand=c("+","+")) features GRanges with 2 ranges and 0 elementMetadata values seqnames ranges strand | <rle> <iranges> <rle> | [1] chr1 [100, 1000] + | [2] chr1 [500, 1500] + | seqlengths chr1 NA reads <- GRanges(seqnames="chr1",ranges=IRanges(c(150,600),c(250,700)) ,strand=c("+","+")) GRanges with 2 ranges and 0 elementMetadata values seqnames ranges strand | <rle> <iranges> <rle> | [1] chr1 [150, 250] + | [2] chr1 [600, 700] + | seqlengths chr1 NA countOverlaps(features,reads) [1] 2 1 -------------------- I think It will be a better to have an argument in countOverlaps function which specifies whether a single hit (eg. a random one) or multiple hits should be returned for each query sequence. Below is my session info: sessionInfo() R version 2.12.0 (2010-10-15) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicFeatures_1.2.3 GenomicRanges_1.2.3 IRanges_1.8.9 loaded via a namespace (and not attached): [1] Biobase_2.10.0 biomaRt_2.5.1 Biostrings_2.18.2 BSgenome_1.18.3 DBI_0.2-5 RCurl_1.4-3 [7] RSQLite_0.9-2 rtracklayer_1.10.6 tools_2.12.0 XML_3.2-0 Cheers, Wei ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

Sequencing GenomicFeatures Sequencing GenomicFeatures • 3.6k views

ADD COMMENT • link 14.6 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 3 months ago

Australia/Melbourne

On Apr 14, 2011, at 10:32 AM, Kasper Daniel Hansen wrote: > On Wed, Apr 13, 2011 at 7:43 PM, Wei Shi <shi at="" wehi.edu.au=""> wrote: >> Hi, >> >> I am using countOverlaps function in GenomicFeatures package to count how many sequencing reads are mapped to each genomic feature (e.g. promoter region, gene region etc.). However I found that if a read was mapped to more than one features, then the count of each feature will be increased by 1, i.e. this read was counted more than one times although it can only originate from one genomic location. Below is a simple example to show this: >> >> ------------------------- >> features <- GRanges(seqnames="chr1", ranges=IRanges(c(100,500),c(1000,1500)) ,strand=c("+","+")) >> features >> GRanges with 2 ranges and 0 elementMetadata values >> seqnames ranges strand | >> <rle> <iranges> <rle> | >> [1] chr1 [100, 1000] + | >> [2] chr1 [500, 1500] + | >> >> seqlengths >> chr1 >> NA >> reads <- GRanges(seqnames="chr1",ranges=IRanges(c(150,600),c(250,70 0)),strand=c("+","+")) >> GRanges with 2 ranges and 0 elementMetadata values >> seqnames ranges strand | >> <rle> <iranges> <rle> | >> [1] chr1 [150, 250] + | >> [2] chr1 [600, 700] + | >> >> seqlengths >> chr1 >> NA >> countOverlaps(features,reads) >> [1] 2 1 >> -------------------- >> >> I think It will be a better to have an argument in countOverlaps function which specifies whether a single hit (eg. a random one) or multiple hits should be returned for each query sequence. > > I don't see the point at all. In your case [100,1000] clearly > overlaps both [150,250] and [600,700]. countOverlaps should return 2. > If you insist on reporting a random hit, countOverlaps would always > return 0 or 1 and never a bigger number. What is the point in this? > > What is true for countOverlaps(query, subject) is that it is not > symmetric wrt. query and subject and this is typically essential to > understand. > > Kasper The point is that the second read ([600, 700]) overlaps with both features and it was counted by both features. So the first feature ([100, 1000]) counts both reads but the second feature ([500, 1500] ) counts the second read again. Therefore, the second read was counted twice. In other words, there are only two reads in this example, but the total number of counts output from countOverlaps is three. Hope this clarifies. Wei ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

ADD COMMENT • link 14.6 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Is it possible to implement an option that will do weighted counts, i.e. if a given sequence maps to two locations then each is counted as 0.5, similarly if it maps to three locations, each mapping is counted as 1/3? -----Original Message----- From: bioconductor-bounces@r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Wei Shi Sent: Wednesday, April 13, 2011 5:47 PM To: Kasper Daniel Hansen Cc: Bioconductor Subject: Re: [BioC] multiple hits with countOverlaps function On Apr 14, 2011, at 10:32 AM, Kasper Daniel Hansen wrote: > On Wed, Apr 13, 2011 at 7:43 PM, Wei Shi <shi at="" wehi.edu.au=""> wrote: >> Hi, >> >> I am using countOverlaps function in GenomicFeatures package to count how many sequencing reads are mapped to each genomic feature (e.g. promoter region, gene region etc.). However I found that if a read was mapped to more than one features, then the count of each feature will be increased by 1, i.e. this read was counted more than one times although it can only originate from one genomic location. Below is a simple example to show this: >> >> ------------------------- >> features <- GRanges(seqnames="chr1", ranges=IRanges(c(100,500),c(1000,1500)) ,strand=c("+","+")) >> features >> GRanges with 2 ranges and 0 elementMetadata values >> seqnames ranges strand | >> <rle> <iranges> <rle> | >> [1] chr1 [100, 1000] + | >> [2] chr1 [500, 1500] + | >> >> seqlengths >> chr1 >> NA >> reads <- GRanges(seqnames="chr1",ranges=IRanges(c(150,600),c(250,700)),strand=c ("+"," +")) >> GRanges with 2 ranges and 0 elementMetadata values >> seqnames ranges strand | >> <rle> <iranges> <rle> | >> [1] chr1 [150, 250] + | >> [2] chr1 [600, 700] + | >> >> seqlengths >> chr1 >> NA >> countOverlaps(features,reads) >> [1] 2 1 >> -------------------- >> >> I think It will be a better to have an argument in countOverlaps function which specifies whether a single hit (eg. a random one) or multiple hits should be returned for each query sequence. > > I don't see the point at all. In your case [100,1000] clearly > overlaps both [150,250] and [600,700]. countOverlaps should return 2. > If you insist on reporting a random hit, countOverlaps would always > return 0 or 1 and never a bigger number. What is the point in this? > > What is true for countOverlaps(query, subject) is that it is not > symmetric wrt. query and subject and this is typically essential to > understand. > > Kasper The point is that the second read ([600, 700]) overlaps with both features and it was counted by both features. So the first feature ([100, 1000]) counts both reads but the second feature ([500, 1500] ) counts the second read again. Therefore, the second read was counted twice. In other words, there are only two reads in this example, but the total number of counts output from countOverlaps is three. Hope this clarifies. Wei ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:19}}

ADD REPLY • link 14.6 years ago Mete Civelek ▴ 180

0

Entering edit mode

On Wed, Apr 13, 2011 at 8:46 PM, Wei Shi <shi at="" wehi.edu.au=""> wrote: > The point is that the second read ([600, 700]) overlaps with both features and it was counted by both features. So the first feature ([100, 1000]) counts both reads but the second feature ([500, 1500] ) counts the second read again. Therefore, the second read was counted twice. In other words, there are only two reads in this example, but the total number of counts output from countOverlaps is three. Yes, and I think this is entirely to be expected. In all my use-cases, this is exactly what I want. I dont get the "the second read was counted twice. " It is the nature of the problem that reads have length > 1 and they can overlap multiple features and you need to thing about how you want to deal with this. I assume you are looking at HTseq data, and I cannot really understand what you are trying to do. Kasper

ADD REPLY • link 14.6 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

On Apr 14, 2011, at 11:00 AM, Kasper Daniel Hansen wrote: > On Wed, Apr 13, 2011 at 8:46 PM, Wei Shi <shi at="" wehi.edu.au=""> wrote: >> The point is that the second read ([600, 700]) overlaps with both features and it was counted by both features. So the first feature ([100, 1000]) counts both reads but the second feature ([500, 1500] ) counts the second read again. Therefore, the second read was counted twice. In other words, there are only two reads in this example, but the total number of counts output from countOverlaps is three. > > Yes, and I think this is entirely to be expected. In all my > use-cases, this is exactly what I want. > > I dont get the "the second read was counted twice. " It is the nature > of the problem that reads have length > 1 and they can overlap > multiple features and you need to thing about how you want to deal > with this. I assume you are looking at HTseq data, and I cannot > really understand what you are trying to do. > > Kasper Let's take RNA-seq data for an example. It is known that many genes overlap with each other in the genome. If a read is mapped to a location which is shared by two genes (by two exons from the two genes respectively to be exact), then this read should not be assigned to both genes because it can only originate from one of the genes/transcripts. However this might not be a problem for other types of sequencing data such as histone ChIP-seq because the marks could affect all the overlapping features. However, it will be useful if there is an option in countOverlaps which allows users decide how to deal with this overlapping feature issue. Wei ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

ADD REPLY • link 14.6 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Hi, On Wed, Apr 13, 2011 at 9:20 PM, Wei Shi <shi at="" wehi.edu.au=""> wrote: [snip] >> I dont get the "the second read was counted twice. " ?It is the nature >> of the problem that reads have length > 1 and they can overlap >> multiple features and you need to thing about how you want to deal >> with this. ?I assume you are looking at HTseq data, and I cannot >> really understand what you are trying to do. >> >> Kasper > > Let's take RNA-seq data for an example. It is known that many genes overlap with each other in the genome. If a read is mapped to a location which is shared by two genes (by two exons from the two genes respectively to be exact), then this read should not be assigned to both genes because it can only originate from one of the genes/transcripts. Yes, it can only originate from one, but how would countOverlaps know which one? How does anyone know which one? As far as I know, there is no standard way to deal with this, so the burden is on you. If you have an idea of "what's right" in this situation, why don't you rather ask the opposite question first, which is: "Which reads overlap more than one feature?" R> features.per.read <- countOverlaps(reads, features) Then do your "naive" counting/quantifying w/ just the "easy" case: R> reads.per.feature <- countOverlaps(features, features.per.read[features.per.read == 1]) Now you can figure out what you want to do for features.per.read > 1, because picking one feature to assign the read to at random by default is clearly wrong, right? What if the gene/isoform that the read overlaps with isn't expressed at all, you'd be assigning reads to it "just because" in your scenario. -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 14.6 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi: No one knows which features is the right one. But my view is that assigning a read to a random feature is better than assigning it to all the overlapping ones. Your code can certainly find a random feature for the read, but adding an option to the function might be better. On Apr 14, 2011, at 11:56 AM, Steve Lianoglou wrote: > Hi, > > On Wed, Apr 13, 2011 at 9:20 PM, Wei Shi <shi at="" wehi.edu.au=""> wrote: > [snip] >>> I dont get the "the second read was counted twice. " It is the nature >>> of the problem that reads have length > 1 and they can overlap >>> multiple features and you need to thing about how you want to deal >>> with this. I assume you are looking at HTseq data, and I cannot >>> really understand what you are trying to do. >>> >>> Kasper >> >> Let's take RNA-seq data for an example. It is known that many genes overlap with each other in the genome. If a read is mapped to a location which is shared by two genes (by two exons from the two genes respectively to be exact), then this read should not be assigned to both genes because it can only originate from one of the genes/transcripts. > > Yes, it can only originate from one, but how would countOverlaps know which one? > > How does anyone know which one? > > As far as I know, there is no standard way to deal with this, so the > burden is on you. > > If you have an idea of "what's right" in this situation, why don't you > rather ask the opposite question first, which is: "Which reads overlap > more than one feature?" > > R> features.per.read <- countOverlaps(reads, features) > > Then do your "naive" counting/quantifying w/ just the "easy" case: > > R> reads.per.feature <- countOverlaps(features, > features.per.read[features.per.read == 1]) > > Now you can figure out what you want to do for features.per.read > 1, > because picking one feature to assign the read to at random by default > is clearly wrong, right? > > What if the gene/isoform that the read overlaps with isn't expressed > at all, you'd be assigning reads to it "just because" in your > scenario. > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

ADD REPLY • link 14.6 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Also I want to mention that choosing a random feature will eventually make all overlapping features have the the same number of such reads too if the total number of such reads is large enough. This should give a less biased read count summarization compared to assigning them to all features (which have the same numbers of such reads too) because total number of reads assigned to features will be the same with the original total when only random features are chosen. On Apr 14, 2011, at 11:56 AM, Steve Lianoglou wrote: > Hi, > > On Wed, Apr 13, 2011 at 9:20 PM, Wei Shi <shi at="" wehi.edu.au=""> wrote: > [snip] >>> I dont get the "the second read was counted twice. " It is the nature >>> of the problem that reads have length > 1 and they can overlap >>> multiple features and you need to thing about how you want to deal >>> with this. I assume you are looking at HTseq data, and I cannot >>> really understand what you are trying to do. >>> >>> Kasper >> >> Let's take RNA-seq data for an example. It is known that many genes overlap with each other in the genome. If a read is mapped to a location which is shared by two genes (by two exons from the two genes respectively to be exact), then this read should not be assigned to both genes because it can only originate from one of the genes/transcripts. > > Yes, it can only originate from one, but how would countOverlaps know which one? > > How does anyone know which one? > > As far as I know, there is no standard way to deal with this, so the > burden is on you. > > If you have an idea of "what's right" in this situation, why don't you > rather ask the opposite question first, which is: "Which reads overlap > more than one feature?" > > R> features.per.read <- countOverlaps(reads, features) > > Then do your "naive" counting/quantifying w/ just the "easy" case: > > R> reads.per.feature <- countOverlaps(features, > features.per.read[features.per.read == 1]) > > Now you can figure out what you want to do for features.per.read > 1, > because picking one feature to assign the read to at random by default > is clearly wrong, right? > > What if the gene/isoform that the read overlaps with isn't expressed > at all, you'd be assigning reads to it "just because" in your > scenario. > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

ADD REPLY • link 14.6 years ago Wei Shi ★ 3.6k

0

Entering edit mode

On Wed, Apr 13, 2011 at 7:33 PM, Wei Shi <shi@wehi.edu.au> wrote: > Also I want to mention that choosing a random feature will eventually make > all overlapping features have the the same number of such reads too if the > total number of such reads is large enough. This should give a less biased > read count summarization compared to assigning them to all features (which > have the same numbers of such reads too) because total number of reads > assigned to features will be the same with the original total when only > random features are chosen. > > Sorry, but randomly choosing overlaps will hamper reproducibility. If you want to count overlaps your own way, why not just use findOverlaps and implement your own counting scheme. Michael > On Apr 14, 2011, at 11:56 AM, Steve Lianoglou wrote: > > > Hi, > > > > On Wed, Apr 13, 2011 at 9:20 PM, Wei Shi <shi@wehi.edu.au> wrote: > > [snip] > >>> I dont get the "the second read was counted twice. " It is the nature > >>> of the problem that reads have length > 1 and they can overlap > >>> multiple features and you need to thing about how you want to deal > >>> with this. I assume you are looking at HTseq data, and I cannot > >>> really understand what you are trying to do. > >>> > >>> Kasper > >> > >> Let's take RNA-seq data for an example. It is known that many genes > overlap with each other in the genome. If a read is mapped to a location > which is shared by two genes (by two exons from the two genes respectively > to be exact), then this read should not be assigned to both genes because it > can only originate from one of the genes/transcripts. > > > > Yes, it can only originate from one, but how would countOverlaps know > which one? > > > > How does anyone know which one? > > > > As far as I know, there is no standard way to deal with this, so the > > burden is on you. > > > > If you have an idea of "what's right" in this situation, why don't you > > rather ask the opposite question first, which is: "Which reads overlap > > more than one feature?" > > > > R> features.per.read <- countOverlaps(reads, features) > > > > Then do your "naive" counting/quantifying w/ just the "easy" case: > > > > R> reads.per.feature <- countOverlaps(features, > > features.per.read[features.per.read == 1]) > > > > Now you can figure out what you want to do for features.per.read > 1, > > because picking one feature to assign the read to at random by default > > is clearly wrong, right? > > > > What if the gene/isoform that the read overlaps with isn't expressed > > at all, you'd be assigning reads to it "just because" in your > > scenario. > > > > -- > > Steve Lianoglou > > Graduate Student: Computational Systems Biology > > | Memorial Sloan-Kettering Cancer Center > > | Weill Medical College of Cornell University > > Contact Info: http://cbio.mskcc.org/~lianos/contact > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:13}}

ADD REPLY • link 14.6 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi all, We've recently added a function to the GenomicRanges package called countGenomicOverlaps. It attempts to address this problem of multi-hit reads by providing the user with options for dividing the read hit. There are only a couple of resolution options at present but you get the idea of where it is going. As Michael mentions, one can always count overlaps with their own implementation. The goal with the countGenomicOverlaps is to provide options to those who may not want to write their own. If you have the time to check it out I'd appreciate any feedback on the usefulness of this type of function in R, relevance of the current options etc. Valerie On 04/13/2011 09:48 PM, Michael Lawrence wrote: > On Wed, Apr 13, 2011 at 7:33 PM, Wei Shi<shi@wehi.edu.au> wrote: > >> Also I want to mention that choosing a random feature will eventually make >> all overlapping features have the the same number of such reads too if the >> total number of such reads is large enough. This should give a less biased >> read count summarization compared to assigning them to all features (which >> have the same numbers of such reads too) because total number of reads >> assigned to features will be the same with the original total when only >> random features are chosen. >> >> > Sorry, but randomly choosing overlaps will hamper reproducibility. If you > want to count overlaps your own way, why not just use findOverlaps and > implement your own counting scheme. > > Michael > > > >> On Apr 14, 2011, at 11:56 AM, Steve Lianoglou wrote: >> >>> Hi, >>> >>> On Wed, Apr 13, 2011 at 9:20 PM, Wei Shi<shi@wehi.edu.au> wrote: >>> [snip] >>>>> I dont get the "the second read was counted twice. " It is the nature >>>>> of the problem that reads have length> 1 and they can overlap >>>>> multiple features and you need to thing about how you want to deal >>>>> with this. I assume you are looking at HTseq data, and I cannot >>>>> really understand what you are trying to do. >>>>> >>>>> Kasper >>>> Let's take RNA-seq data for an example. It is known that many genes >> overlap with each other in the genome. If a read is mapped to a location >> which is shared by two genes (by two exons from the two genes respectively >> to be exact), then this read should not be assigned to both genes because it >> can only originate from one of the genes/transcripts. >>> Yes, it can only originate from one, but how would countOverlaps know >> which one? >>> How does anyone know which one? >>> >>> As far as I know, there is no standard way to deal with this, so the >>> burden is on you. >>> >>> If you have an idea of "what's right" in this situation, why don't you >>> rather ask the opposite question first, which is: "Which reads overlap >>> more than one feature?" >>> >>> R> features.per.read<- countOverlaps(reads, features) >>> >>> Then do your "naive" counting/quantifying w/ just the "easy" case: >>> >>> R> reads.per.feature<- countOverlaps(features, >>> features.per.read[features.per.read == 1]) >>> >>> Now you can figure out what you want to do for features.per.read> 1, >>> because picking one feature to assign the read to at random by default >>> is clearly wrong, right? >>> >>> What if the gene/isoform that the read overlaps with isn't expressed >>> at all, you'd be assigning reads to it "just because" in your >>> scenario. >>> >>> -- >>> Steve Lianoglou >>> Graduate Student: Computational Systems Biology >>> | Memorial Sloan-Kettering Cancer Center >>> | Weill Medical College of Cornell University >>> Contact Info: http://cbio.mskcc.org/~lianos/contact >> >> ______________________________________________________________________ >> The information in this email is confidential and inte...{{dropped:13}} > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 14.6 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

On Thu, Apr 14, 2011 at 9:55 AM, Valerie Obenchain <vobencha@fhcrc.org>wrote: > Hi all, > > We've recently added a function to the GenomicRanges package called > countGenomicOverlaps. It attempts to address this problem of multi- hit > reads by providing the user with options for dividing the read hit. > There are only a couple of resolution options at present but you get the > idea of where it is going. > > This is cool, but could this be generalized? I don't see what's so "genomic" about weighting cases where multiple queries hit the same subject. This could land in IRanges. > As Michael mentions, one can always count overlaps with their own > implementation. The goal with the countGenomicOverlaps is to provide > options to those who may not want to write their own. If you have the > time to check it out I'd appreciate any feedback on the usefulness of > this type of function in R, relevance of the current options etc. > > Valerie > > > > On 04/13/2011 09:48 PM, Michael Lawrence wrote: > > On Wed, Apr 13, 2011 at 7:33 PM, Wei Shi<shi@wehi.edu.au> wrote: > > > >> Also I want to mention that choosing a random feature will eventually > make > >> all overlapping features have the the same number of such reads too if > the > >> total number of such reads is large enough. This should give a less > biased > >> read count summarization compared to assigning them to all features > (which > >> have the same numbers of such reads too) because total number of reads > >> assigned to features will be the same with the original total when only > >> random features are chosen. > >> > >> > > Sorry, but randomly choosing overlaps will hamper reproducibility. If you > > want to count overlaps your own way, why not just use findOverlaps and > > implement your own counting scheme. > > > > Michael > > > > > > > >> On Apr 14, 2011, at 11:56 AM, Steve Lianoglou wrote: > >> > >>> Hi, > >>> > >>> On Wed, Apr 13, 2011 at 9:20 PM, Wei Shi<shi@wehi.edu.au> wrote: > >>> [snip] > >>>>> I dont get the "the second read was counted twice. " It is the > nature > >>>>> of the problem that reads have length> 1 and they can overlap > >>>>> multiple features and you need to thing about how you want to deal > >>>>> with this. I assume you are looking at HTseq data, and I cannot > >>>>> really understand what you are trying to do. > >>>>> > >>>>> Kasper > >>>> Let's take RNA-seq data for an example. It is known that many genes > >> overlap with each other in the genome. If a read is mapped to a location > >> which is shared by two genes (by two exons from the two genes > respectively > >> to be exact), then this read should not be assigned to both genes > because it > >> can only originate from one of the genes/transcripts. > >>> Yes, it can only originate from one, but how would countOverlaps know > >> which one? > >>> How does anyone know which one? > >>> > >>> As far as I know, there is no standard way to deal with this, so the > >>> burden is on you. > >>> > >>> If you have an idea of "what's right" in this situation, why don't you > >>> rather ask the opposite question first, which is: "Which reads overlap > >>> more than one feature?" > >>> > >>> R> features.per.read<- countOverlaps(reads, features) > >>> > >>> Then do your "naive" counting/quantifying w/ just the "easy" case: > >>> > >>> R> reads.per.feature<- countOverlaps(features, > >>> features.per.read[features.per.read == 1]) > >>> > >>> Now you can figure out what you want to do for features.per.read> 1, > >>> because picking one feature to assign the read to at random by default > >>> is clearly wrong, right? > >>> > >>> What if the gene/isoform that the read overlaps with isn't expressed > >>> at all, you'd be assigning reads to it "just because" in your > >>> scenario. > >>> > >>> -- > >>> Steve Lianoglou > >>> Graduate Student: Computational Systems Biology > >>> | Memorial Sloan-Kettering Cancer Center > >>> | Weill Medical College of Cornell University > >>> Contact Info: http://cbio.mskcc.org/~lianos/contact > >> > >> ______________________________________________________________________ > >> The information in this email is confidential and inte...{{dropped:13}} > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 14.6 years ago Michael Lawrence ★ 11k

0

Entering edit mode

An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20110415="" 309f9be5="" attachment-0001.pl="">

ADD REPLY • link 14.6 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

On 04/14/2011 03:51 PM, Michael Lawrence wrote: > > > On Thu, Apr 14, 2011 at 9:55 AM, Valerie Obenchain <vobencha@fhcrc.org> <mailto:vobencha@fhcrc.org>> wrote: > > Hi all, > > We've recently added a function to the GenomicRanges package called > countGenomicOverlaps. It attempts to address this problem of multi-hit > reads by providing the user with options for dividing the read hit. > There are only a couple of resolution options at present but you > get the > idea of where it is going. > > > This is cool, but could this be generalized? I don't see what's so > "genomic" about weighting cases where multiple queries hit the same > subject. This could land in IRanges. We are open to input re: the final name/location of the function. We went with "genomic" because it was developed with the exon/transcript/gene idea in mind. If the consensus is that IRanges is more appropriate then maybe it should live there. Valerie > > As Michael mentions, one can always count overlaps with their own > implementation. The goal with the countGenomicOverlaps is to provide > options to those who may not want to write their own. If you have the > time to check it out I'd appreciate any feedback on the usefulness of > this type of function in R, relevance of the current options etc. > > Valerie > > > > On 04/13/2011 09:48 PM, Michael Lawrence wrote: > > On Wed, Apr 13, 2011 at 7:33 PM, Wei Shi<shi@wehi.edu.au> <mailto:shi@wehi.edu.au>> wrote: > > > >> Also I want to mention that choosing a random feature will > eventually make > >> all overlapping features have the the same number of such reads > too if the > >> total number of such reads is large enough. This should give a > less biased > >> read count summarization compared to assigning them to all > features (which > >> have the same numbers of such reads too) because total number > of reads > >> assigned to features will be the same with the original total > when only > >> random features are chosen. > >> > >> > > Sorry, but randomly choosing overlaps will hamper > reproducibility. If you > > want to count overlaps your own way, why not just use > findOverlaps and > > implement your own counting scheme. > > > > Michael > > > > > > > >> On Apr 14, 2011, at 11:56 AM, Steve Lianoglou wrote: > >> > >>> Hi, > >>> > >>> On Wed, Apr 13, 2011 at 9:20 PM, Wei Shi<shi@wehi.edu.au> <mailto:shi@wehi.edu.au>> wrote: > >>> [snip] > >>>>> I dont get the "the second read was counted twice. " It is > the nature > >>>>> of the problem that reads have length> 1 and they can overlap > >>>>> multiple features and you need to thing about how you want > to deal > >>>>> with this. I assume you are looking at HTseq data, and I cannot > >>>>> really understand what you are trying to do. > >>>>> > >>>>> Kasper > >>>> Let's take RNA-seq data for an example. It is known that many > genes > >> overlap with each other in the genome. If a read is mapped to a > location > >> which is shared by two genes (by two exons from the two genes > respectively > >> to be exact), then this read should not be assigned to both > genes because it > >> can only originate from one of the genes/transcripts. > >>> Yes, it can only originate from one, but how would > countOverlaps know > >> which one? > >>> How does anyone know which one? > >>> > >>> As far as I know, there is no standard way to deal with this, > so the > >>> burden is on you. > >>> > >>> If you have an idea of "what's right" in this situation, why > don't you > >>> rather ask the opposite question first, which is: "Which reads > overlap > >>> more than one feature?" > >>> > >>> R> features.per.read<- countOverlaps(reads, features) > >>> > >>> Then do your "naive" counting/quantifying w/ just the "easy" > case: > >>> > >>> R> reads.per.feature<- countOverlaps(features, > >>> features.per.read[features.per.read == 1]) > >>> > >>> Now you can figure out what you want to do for > features.per.read> 1, > >>> because picking one feature to assign the read to at random by > default > >>> is clearly wrong, right? > >>> > >>> What if the gene/isoform that the read overlaps with isn't > expressed > >>> at all, you'd be assigning reads to it "just because" in your > >>> scenario. > >>> > >>> -- > >>> Steve Lianoglou > >>> Graduate Student: Computational Systems Biology > >>> | Memorial Sloan-Kettering Cancer Center > >>> | Weill Medical College of Cornell University > >>> Contact Info: http://cbio.mskcc.org/~lianos/contact > <http: cbio.mskcc.org="" %7elianos="" contact=""> > >> > >> > ______________________________________________________________________ > >> The information in this email is confidential and > inte...{{dropped:13}} > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org <mailto:bioconductor@r-project.org> > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org <mailto:bioconductor@r-project.org> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]]

ADD REPLY • link 14.6 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 3 months ago

Australia/Melbourne

On Apr 14, 2011, at 11:34 AM, Kasper Daniel Hansen wrote: > On Wed, Apr 13, 2011 at 9:19 PM, Wei Shi <shi at="" wehi.edu.au=""> wrote: >> Let's take RNA-seq data for an example. It is known that many genes overlap with each other in the genome. If a read is mapped to a location which is shared by two genes (by two exons from the two genes respectively to be exact), then this read should not be assigned to both genes because it can only originate from one of the genes/transcripts. > > Why? I mean, I see what you mean from an assay point of view. But > clearly it is highly non-trivial to decide which of the two genes you > want to assign the read to. People have proposed methods for this but > it is certainly not resolved. I would argue that the first step is to > divide the 2 genes into 3 genomic regions, gene1 only, gene2 only and > both gene1 and gene2, count the overlap and then proceed from there. > > Doing a (uniform) random assignment is clearly wrong (for this > particular use case). > > Finally, this cannot be the situation your example is referring to, > since you did not have overlapping features. You were instead > implying that if the first read hit feature A, the second read should > not be allowed to hit feature A. (but that is perhaps not too > important). > > I might be argumentative, but you seem to imply that this > "unresolveness" should be handled by a very simple option to > countOverlaps and I would argue that this is not a solution. Indeed I > would argue that users who do not wish to think hard about this are > ignoring a possibly important (and possibly unimportant) issue with > the data. > > Kasper Please bring your posts to the mailing list, Kasper. The two features in my email are overlapping. I re-paste the features below. >> features <- GRanges(seqnames="chr1", ranges=IRanges(c(100,500),c(1000,1500)) ,strand=c("+","+")) >> features >> GRanges with 2 ranges and 0 elementMetadata values >> seqnames ranges strand | >> <rle> <iranges> <rle> | >> [1] chr1 [100, 1000] + | >> [2] chr1 [500, 1500] + | > I did not make any implication on which read should hit which feature. The purpose of that example was just to show that the read which hits multiple features will be counted by each of them. Also, I did not imply choosing a random feature will solve the multi- feature hit problem and I do not think it will solve this problem. However, I think assigning the read to a random feature is better than assigning it to all of them. Wei ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

ADD COMMENT • link 14.6 years ago Wei Shi ★ 3.6k

Login before adding your answer.