Question: find overlap of bed files of different length
0
gravatar for Duke
8.8 years ago by
Duke210
Duke210 wrote:
On 1/31/11 1:20 PM, Kasper Daniel Hansen wrote: > Use findOverlaps to find all cases. This is usually the hard and big > computation. Then use for example pintersect() to compute the actual > overlap in percent. There might be some tedious coding involved. Thanks for your suggestion Kasper, though honestly I have not tried it yet. But based on what Martin and you suggested, I thought the final code will not run fast because of extracting to strand/subset and running each. Especially my task is a little more complicated: I need to find gene expressions (counting sequences in exonic regions of each gene). I also gave BEDTools a try, but it does not fulfil my needs (extremely slow for a gene list of 28k). I ended up with coding a c++ code to do the job. Thanks for all of your suggestions and helps guys. D.
• 866 views
ADD COMMENTlink modified 8.8 years ago by Michael Lawrence11k • written 8.8 years ago by Duke210
Answer: find overlap of bed files of different length
0
gravatar for Michael Lawrence
8.8 years ago by
United States
Michael Lawrence11k wrote:
On Tue, Feb 1, 2011 at 7:06 AM, Duke <duke.lists@gmx.com> wrote: > On 1/31/11 1:20 PM, Kasper Daniel Hansen wrote: > >> Use findOverlaps to find all cases. This is usually the hard and big >> computation. Then use for example pintersect() to compute the actual >> overlap in percent. There might be some tedious coding involved. >> > > Thanks for your suggestion Kasper, though honestly I have not tried it yet. > But based on what Martin and you suggested, I thought the final code will > not run fast because of extracting to strand/subset and running each. > Especially my task is a little more complicated: I need to find gene > expressions (counting sequences in exonic regions of each gene). I also gave > BEDTools a try, but it does not fulfil my needs (extremely slow for a gene > list of 28k). > > I ended up with coding a c++ code to do the job. Thanks for all of your > suggestions and helps guys. > > It would be nice to have a little more detail about what you needed. If findOverlaps and friends aren't doing the job, it would be good to know. Counting reads in exons of genes is as simple as calling countOverlaps on the GRangesList of the exons. > > D. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENTlink written 8.8 years ago by Michael Lawrence11k
On 2/1/11 1:11 PM, Michael Lawrence wrote: > > > On Tue, Feb 1, 2011 at 7:06 AM, Duke <duke.lists@gmx.com> <mailto:duke.lists@gmx.com>> wrote: > > On 1/31/11 1:20 PM, Kasper Daniel Hansen wrote: > > Use findOverlaps to find all cases. This is usually the hard > and big > computation. Then use for example pintersect() to compute the > actual > overlap in percent. There might be some tedious coding involved. > > > Thanks for your suggestion Kasper, though honestly I have not > tried it yet. But based on what Martin and you suggested, I > thought the final code will not run fast because of extracting to > strand/subset and running each. Especially my task is a little > more complicated: I need to find gene expressions (counting > sequences in exonic regions of each gene). I also gave BEDTools a > try, but it does not fulfil my needs (extremely slow for a gene > list of 28k). > > I ended up with coding a c++ code to do the job. Thanks for all of > your suggestions and helps guys. > > > It would be nice to have a little more detail about what you needed. > If findOverlaps and friends aren't doing the job, it would be good to > know. Counting reads in exons of genes is as simple as calling > countOverlaps on the GRangesList of the exons. Hi Micheal, My task is to count the reads of a bed file of different length in exons of genes with a controllable overlap option (by percentage, not by bases). Some people want to count it with overlap=100% length of reads, but some other might want to count it with 20% for example. This option should be very similar to minOverlap, but in percentage instead of bases. D. [[alternative HTML version deleted]]
ADD REPLYlink written 8.8 years ago by Duke210
On Tue, Feb 1, 2011 at 10:35 AM, Duke <duke.lists@gmx.com> wrote: > On 2/1/11 1:11 PM, Michael Lawrence wrote: > > > > On Tue, Feb 1, 2011 at 7:06 AM, Duke <duke.lists@gmx.com> wrote: > >> On 1/31/11 1:20 PM, Kasper Daniel Hansen wrote: >> >>> Use findOverlaps to find all cases. This is usually the hard and big >>> computation. Then use for example pintersect() to compute the actual >>> overlap in percent. There might be some tedious coding involved. >>> >> >> Thanks for your suggestion Kasper, though honestly I have not tried it >> yet. But based on what Martin and you suggested, I thought the final code >> will not run fast because of extracting to strand/subset and running each. >> Especially my task is a little more complicated: I need to find gene >> expressions (counting sequences in exonic regions of each gene). I also gave >> BEDTools a try, but it does not fulfil my needs (extremely slow for a gene >> list of 28k). >> >> I ended up with coding a c++ code to do the job. Thanks for all of your >> suggestions and helps guys. >> >> > It would be nice to have a little more detail about what you needed. If > findOverlaps and friends aren't doing the job, it would be good to know. > Counting reads in exons of genes is as simple as calling countOverlaps on > the GRangesList of the exons. > > > Hi Micheal, > > My task is to count the reads of a bed file of different length in exons of > genes with a controllable overlap option (by percentage, not by bases). Some > people want to count it with overlap=100% length of reads, but some other > might want to count it with 20% for example. This option should be very > similar to minOverlap, but in percentage instead of bases. > > This is a reasonable request. As Kasper mentioned, it's possible with post processing. E.g.: m <- findOverlaps(query, subject) percentOverlap <- width(ranges(m, query, subject)) / width(query)[queryHits(m)] keep <- percentOverlap > cutoff Perhaps someone up North could add this to IRanges/GenomicRanges? Michael D. > [[alternative HTML version deleted]]
ADD REPLYlink written 8.8 years ago by Michael Lawrence11k
On 02/01/2011 11:08 AM, Michael Lawrence wrote: > On Tue, Feb 1, 2011 at 10:35 AM, Duke <duke.lists at="" gmx.com=""> wrote: > >> On 2/1/11 1:11 PM, Michael Lawrence wrote: >> >> >> >> On Tue, Feb 1, 2011 at 7:06 AM, Duke <duke.lists at="" gmx.com=""> wrote: >> >>> On 1/31/11 1:20 PM, Kasper Daniel Hansen wrote: >>> >>>> Use findOverlaps to find all cases. This is usually the hard and big >>>> computation. Then use for example pintersect() to compute the actual >>>> overlap in percent. There might be some tedious coding involved. >>>> >>> >>> Thanks for your suggestion Kasper, though honestly I have not tried it >>> yet. But based on what Martin and you suggested, I thought the final code >>> will not run fast because of extracting to strand/subset and running each. >>> Especially my task is a little more complicated: I need to find gene >>> expressions (counting sequences in exonic regions of each gene). I also gave >>> BEDTools a try, but it does not fulfil my needs (extremely slow for a gene >>> list of 28k). >>> >>> I ended up with coding a c++ code to do the job. Thanks for all of your >>> suggestions and helps guys. >>> >>> >> It would be nice to have a little more detail about what you needed. If >> findOverlaps and friends aren't doing the job, it would be good to know. >> Counting reads in exons of genes is as simple as calling countOverlaps on >> the GRangesList of the exons. >> >> >> Hi Micheal, >> >> My task is to count the reads of a bed file of different length in exons of >> genes with a controllable overlap option (by percentage, not by bases). Some >> people want to count it with overlap=100% length of reads, but some other >> might want to count it with 20% for example. This option should be very >> similar to minOverlap, but in percentage instead of bases. >> >> > This is a reasonable request. As Kasper mentioned, it's possible with post > processing. > > E.g.: > > m <- findOverlaps(query, subject) > percentOverlap <- width(ranges(m, query, subject)) / > width(query)[queryHits(m)] > keep <- percentOverlap > cutoff there are rough edges, e.g., (G)RangesList/(G)RangesList, but yes this will make it into GenomicRanges. Martin > > Perhaps someone up North could add this to IRanges/GenomicRanges? > > Michael > > D. >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793
ADD REPLYlink written 8.8 years ago by Martin Morgan ♦♦ 24k
Hi, On 02/01/2011 11:08 AM, Michael Lawrence wrote: [...] > This is a reasonable request. As Kasper mentioned, it's possible with post > processing. > > E.g.: > > m<- findOverlaps(query, subject) > percentOverlap<- width(ranges(m, query, subject)) / > width(query)[queryHits(m)] > keep<- percentOverlap> cutoff > > Perhaps someone up North could add this to IRanges/GenomicRanges? Would be my pleasure. Yes the functionality provided by this ranges,RangesMatching method is very useful but maybe we should try to give it a little bit more exposure. Right now it's kind of "hidden" in the man page for RangesMatching (which you will get with something like ?`RangesMatching-class` or ?`ranges,RangesMatching-method`) and not mentioned in any vignette AFAIK. Also it works only if query and subject are both Ranges objects but not with GRanges objects. So I propose to: 1. Rename this to something like overlaps(), or overlappingRanges(), or ... (fill the blank), and deprecate the ranges,RangesMatching method. overlaps() would be a new generic with the m,query,subject signature. The current name is probably part of the reason why nobody (except the original author of the method) knew about it. I personally find it unexpected that a quite non-specific name like "ranges" is used for this when it generally plays the role of an accessor to get/set the ranges of containers like IRanges, RangedData, GRanges, etc... 2. Add a "overlaps" method for RangesMatching,GenomicRanges,GenomicRanges. 3. Illustrate how to use this in the vignettes of GenomicRanges. Would that be OK? H. > > Michael > > D. >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLYlink written 8.8 years ago by Hervé Pagès ♦♦ 14k
2011/2/1 Hervé Pagès <hpages@fhcrc.org> > Hi, > > > On 02/01/2011 11:08 AM, Michael Lawrence wrote: > [...] > > This is a reasonable request. As Kasper mentioned, it's possible with post >> processing. >> >> E.g.: >> >> m<- findOverlaps(query, subject) >> percentOverlap<- width(ranges(m, query, subject)) / >> width(query)[queryHits(m)] >> keep<- percentOverlap> cutoff >> >> Perhaps someone up North could add this to IRanges/GenomicRanges? >> > > Would be my pleasure. Yes the functionality provided by this > ranges,RangesMatching method is very useful but maybe we should try > to give it a little bit more exposure. Right now it's kind of "hidden" > in the man page for RangesMatching (which you will get with something > like ?`RangesMatching-class` or ?`ranges,RangesMatching-method`) and > not mentioned in any vignette AFAIK. Also it works only if query and > subject are both Ranges objects but not with GRanges objects. > > So I propose to: > > 1. Rename this to something like overlaps(), or overlappingRanges(), > or ... (fill the blank), and deprecate the ranges,RangesMatching > method. > overlaps() would be a new generic with the m,query,subject > signature. > The current name is probably part of the reason why nobody (except > the original author of the method) knew about it. > I personally find it unexpected that a quite non-specific name > like "ranges" is used for this when it generally plays the role > of an accessor to get/set the ranges of containers like IRanges, > RangedData, GRanges, etc... > > 2. Add a "overlaps" method for > RangesMatching,GenomicRanges,GenomicRanges. > > 3. Illustrate how to use this in the vignettes of GenomicRanges. > > Would that be OK? > > That would be great, but I was really referring to supporting a fractional minOverlaps parameter to findOverlaps, etc. Sorry that I wasn't clear. > H. > > >> Michael >> >> D. >> >>> >>> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]
ADD REPLYlink written 8.8 years ago by Michael Lawrence11k
What about a more generic solution to this very reasonable utility request by returning a somewhat complete overlap mapping result in a GRanges object that contains overlap start/end positions, percent/absolute overlap, overlap types, etc. The relative overlap filtering by percent values can then be performed very easily in a second step, like in this example: ## Two sample GRanges objects library(GenomicRanges) grq <- GRanges(seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), ranges = IRanges(seq(1, 100, by=10), end = seq(30, 120, by=10)), strand = Rle(strand(c("-", "+", "-")), c(1, 7, 2))) grs <- shift(grq[c(2,5,6)], 5) ## Return absolute and relative overlap positions with olRanges function source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Script s/rangeoverlapper.R") myol <- olRanges(query=grq, subject=grs, output="gr") # output="df" returns data frame myol ## Return overlaps that cover at least 50% of the query/subject ranges. myol[elementMetadata(myol)[, "OLpercQ"] > 50] myol[elementMetadata(myol)[, "OLpercS"] > 50] This works with both GRanges or IRanges objects as input. Wouldn't something like this be a nice "convenience output" argument for the findOverlap function??? Thomas On Tue, Feb 01, 2011 at 06:41:19PM -0800, Michael Lawrence wrote: > 2011/2/1 Hervé Pagès <hpages at="" fhcrc.org=""> > > > Hi, > > > > > > On 02/01/2011 11:08 AM, Michael Lawrence wrote: > > [...] > > > > This is a reasonable request. As Kasper mentioned, it's possible with post > >> processing. > >> > >> E.g.: > >> > >> m<- findOverlaps(query, subject) > >> percentOverlap<- width(ranges(m, query, subject)) / > >> width(query)[queryHits(m)] > >> keep<- percentOverlap> cutoff > >> > >> Perhaps someone up North could add this to IRanges/GenomicRanges? > >> > > > > Would be my pleasure. Yes the functionality provided by this > > ranges,RangesMatching method is very useful but maybe we should try > > to give it a little bit more exposure. Right now it's kind of "hidden" > > in the man page for RangesMatching (which you will get with something > > like ?`RangesMatching-class` or ?`ranges,RangesMatching-method`) and > > not mentioned in any vignette AFAIK. Also it works only if query and > > subject are both Ranges objects but not with GRanges objects. > > > > So I propose to: > > > > 1. Rename this to something like overlaps(), or overlappingRanges(), > > or ... (fill the blank), and deprecate the ranges,RangesMatching > > method. > > overlaps() would be a new generic with the m,query,subject > > signature. > > The current name is probably part of the reason why nobody (except > > the original author of the method) knew about it. > > I personally find it unexpected that a quite non-specific name > > like "ranges" is used for this when it generally plays the role > > of an accessor to get/set the ranges of containers like IRanges, > > RangedData, GRanges, etc... > > > > 2. Add a "overlaps" method for > > RangesMatching,GenomicRanges,GenomicRanges. > > > > 3. Illustrate how to use this in the vignettes of GenomicRanges. > > > > Would that be OK? > > > > > That would be great, but I was really referring to supporting a fractional > minOverlaps parameter to findOverlaps, etc. Sorry that I wasn't clear. > > > > H. > > > > > >> Michael > >> > >> D. > >> > >>> > >>> > >> [[alternative HTML version deleted]] > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > > > > > -- > > Hervé Pagès > > > > Program in Computational Biology > > Division of Public Health Sciences > > > > Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N, M2-B876 > > P.O. Box 19024 > > Seattle, WA 98109-1024 > > > > E-mail: hpages at fhcrc.org > > Phone: (206) 667-5791 > > Fax: (206) 667-1319 > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLYlink written 8.8 years ago by Thomas Girke1.7k
Hi, On Tue, Feb 8, 2011 at 11:59 AM, Thomas Girke <thomas.girke at="" ucr.edu=""> wrote: > What about a more generic solution to this very reasonable utility > request by returning a somewhat complete overlap mapping result in a > GRanges object that contains overlap start/end positions, > percent/absolute overlap, overlap types, etc. The relative overlap > filtering by percent values can then be performed very easily in a > second step, like in this example: > > ## Two sample GRanges objects > library(GenomicRanges) > grq <- GRanges(seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), > ? ? ? ? ? ? ? ranges = IRanges(seq(1, 100, by=10), end = seq(30, 120, by=10)), > ? ? ? ? ? ? ? strand = Rle(strand(c("-", "+", "-")), c(1, 7, 2))) > grs <- shift(grq[c(2,5,6)], 5) > > ## Return absolute and relative overlap positions with olRanges function > source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scri pts/rangeoverlapper.R") > myol <- olRanges(query=grq, subject=grs, output="gr") # output="df" returns data frame > myol > > ## Return overlaps that cover at least 50% of the query/subject ranges. > myol[elementMetadata(myol)[, "OLpercQ"] > 50] > myol[elementMetadata(myol)[, "OLpercS"] > 50] > > > This works with both GRanges or IRanges objects as input. Wouldn't something like this > be a nice "convenience output" argument for the findOverlap function??? After Michael's original input as to how to solve the problem, I added "my own" `quantifyOverlaps` function to my utility-belt which does much of what you suggest (really, it just returns an augmented matchMatrix). If the devs are opposed to adding more and more arguments to findOverlaps, perhaps a quanitfyOverlaps (or similar named function) that explicitly does this might be a more palatable alternative. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLYlink written 8.8 years ago by Steve Lianoglou12k
On Tue, Feb 8, 2011 at 9:05 AM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hi, > > On Tue, Feb 8, 2011 at 11:59 AM, Thomas Girke <thomas.girke@ucr.edu> > wrote: > > What about a more generic solution to this very reasonable utility > > request by returning a somewhat complete overlap mapping result in a > > GRanges object that contains overlap start/end positions, > > percent/absolute overlap, overlap types, etc. The relative overlap > > filtering by percent values can then be performed very easily in a > > second step, like in this example: > > > > ## Two sample GRanges objects > > library(GenomicRanges) > > grq <- GRanges(seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, > 2, 4)), > > ranges = IRanges(seq(1, 100, by=10), end = seq(30, 120, > by=10)), > > strand = Rle(strand(c("-", "+", "-")), c(1, 7, 2))) > > grs <- shift(grq[c(2,5,6)], 5) > > > > ## Return absolute and relative overlap positions with olRanges function > > source(" > http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/rang eoverlapper.R > ") > > myol <- olRanges(query=grq, subject=grs, output="gr") # output="df" > returns data frame > > myol > > > > ## Return overlaps that cover at least 50% of the query/subject ranges. > > myol[elementMetadata(myol)[, "OLpercQ"] > 50] > > myol[elementMetadata(myol)[, "OLpercS"] > 50] > > > > > > This works with both GRanges or IRanges objects as input. Wouldn't > something like this > > be a nice "convenience output" argument for the findOverlap function??? > > After Michael's original input as to how to solve the problem, I added > "my own" `quantifyOverlaps` function to my utility-belt which does > much of what you suggest (really, it just returns an augmented > matchMatrix). > > If the devs are opposed to adding more and more arguments to > findOverlaps, perhaps a quanitfyOverlaps (or similar named function) > that explicitly does this might be a more palatable alternative. > > There are many possible operations that are essentially filtering/processing of the initial matching result. How much of this could be made easier by adding methods to RangesMatching(List)? Michael -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > [[alternative HTML version deleted]]
ADD REPLYlink written 8.8 years ago by Michael Lawrence11k
Dear IRanges / GenomicRanges / GenomicFeatures developers, first, thank you for these great packages, I use them a lot, because the are general and flexible enough, and yet very focused on genomic spaces. my question is in the title. I am glad to have a set of transcripts grouped by Entrez Gene IDs with the function transcriptBy(). However I would like to know what was the method and the rationale to create this association. The UCSC knownGene track, despite the name, is organized around transcripts rather than "genes". I cannot find this table on the UCSC genome website, neither I see documentation about it. I wold be glad to learn more about it. Best regards, Arnaud Amzallag Research Fellow Cancer center of the Massachusetts General Hospital Harvard Medical School
ADD REPLYlink written 8.8 years ago by Arnaud Amzallag100
There's a table that maps the UCSC ID's to Entrez Gene. Called something like knownToLocusLink. Michael On Fri, Feb 4, 2011 at 11:37 AM, Arnaud Amzallag <arnaud.amzallag@gmail.com>wrote: > Dear IRanges / GenomicRanges / GenomicFeatures developers, > > first, thank you for these great packages, I use them a lot, because the > are general and flexible enough, and yet very focused on genomic spaces. > > my question is in the title. I am glad to have a set of transcripts grouped > by Entrez Gene IDs with the function transcriptBy(). However I would like to > know what was the method and the rationale to create this association. The > UCSC knownGene track, despite the name, is organized around transcripts > rather than "genes". I cannot find this table on the UCSC genome website, > neither I see documentation about it. I wold be glad to learn more about it. > > Best regards, > > Arnaud Amzallag > Research Fellow > Cancer center of the Massachusetts General Hospital > Harvard Medical School > > [[alternative HTML version deleted]]
ADD REPLYlink written 8.8 years ago by Michael Lawrence11k
Thank you for your help Michael. I might ask them how they mapped it, I guess they had some coordinates for the Entrez genes and overlapped them with the transcripts. Arnaud On 4 févr. 11, at 18:11, Michael Lawrence wrote: > There's a table that maps the UCSC ID's to Entrez Gene. Called > something like knownToLocusLink. > > Michael > > On Fri, Feb 4, 2011 at 11:37 AM, Arnaud Amzallag <arnaud.amzallag@gmail.com> > wrote: > Dear IRanges / GenomicRanges / GenomicFeatures developers, > > first, thank you for these great packages, I use them a lot, because > the are general and flexible enough, and yet very focused on genomic > spaces. > > my question is in the title. I am glad to have a set of transcripts > grouped by Entrez Gene IDs with the function transcriptBy(). However > I would like to know what was the method and the rationale to create > this association. The UCSC knownGene track, despite the name, is > organized around transcripts rather than "genes". I cannot find this > table on the UCSC genome website, neither I see documentation about > it. I wold be glad to learn more about it. > > Best regards, > > Arnaud Amzallag > Research Fellow > Cancer center of the Massachusetts General Hospital > Harvard Medical School > > [[alternative HTML version deleted]]
ADD REPLYlink written 8.8 years ago by Arnaud Amzallag100
On Tue, Feb 1, 2011 at 2:08 PM, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: [snip] >> My task is to count the reads of a bed file of different length in exons of >> genes with a controllable overlap option (by percentage, not by bases). Some >> people want to count it with overlap=100% length of reads, but some other >> might want to count it with 20% for example. This option should be very >> similar to minOverlap, but in percentage instead of bases. >> >> > This is a reasonable request. As Kasper mentioned, it's possible with post > processing. > > E.g.: > > m <- findOverlaps(query, subject) > percentOverlap <- width(ranges(m, query, subject)) / > width(query)[queryHits(m)] > keep <- percentOverlap > cutoff > > Perhaps someone up North could add this to IRanges/GenomicRanges? [/snip] Hah! Very smooth! Now we just need the findOverlaps::select parameter to take "{max|min}Overlap" as a valid value and we can call it a day ... :-) -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLYlink written 8.8 years ago by Steve Lianoglou12k
Answer: find overlap of bed files of different length
0
gravatar for Kasper Daniel Hansen
8.8 years ago by
United States
Kasper Daniel Hansen6.4k wrote:
Well, clearly I have not done it, but I would expect that a decent implementation of my method would take less than 2 minutes (although it depends on length of the stuff in the BED file you started with). At least the computational load should not be much more than running findOverlaps. Kasper On Tue, Feb 1, 2011 at 10:06 AM, Duke <duke.lists at="" gmx.com=""> wrote: > On 1/31/11 1:20 PM, Kasper Daniel Hansen wrote: >> >> Use findOverlaps to find all cases. ?This is usually the hard and big >> computation. ?Then use for example pintersect() to compute the actual >> overlap in percent. ?There might be some tedious coding involved. > > Thanks for your suggestion Kasper, though honestly I have not tried it yet. > But based on what Martin and you suggested, I thought the final code will > not run fast because of extracting to strand/subset and running each. > Especially my task is a little more complicated: I need to find gene > expressions (counting sequences in exonic regions of each gene). I also gave > BEDTools a try, but it does not fulfil my needs (extremely slow for a gene > list of 28k). > > I ended up with coding a c++ code to do the job. Thanks for all of your > suggestions and helps guys. > > D. >
ADD COMMENTlink written 8.8 years ago by Kasper Daniel Hansen6.4k
On 2/1/11 12:07 PM, Kasper Daniel Hansen wrote: > Well, clearly I have not done it, but I would expect that a decent > implementation of my method would take less than 2 minutes (although > it depends on length of the stuff in the BED file you started with). > At least the computational load should not be much more than running > findOverlaps. I definitely want to solve my problem using R, but given that I am still new to R and that I have anlysis to be done, and that I need something that get the job done quick (that was why I decided to go for R with the hope that some bioconductor packages would help), I got it done with C++ first. As soon as I have a more time to spend, I will try to make it to work with R. D. > Kasper > > On Tue, Feb 1, 2011 at 10:06 AM, Duke<duke.lists at="" gmx.com=""> wrote: >> On 1/31/11 1:20 PM, Kasper Daniel Hansen wrote: >>> Use findOverlaps to find all cases. This is usually the hard and big >>> computation. Then use for example pintersect() to compute the actual >>> overlap in percent. There might be some tedious coding involved. >> Thanks for your suggestion Kasper, though honestly I have not tried it yet. >> But based on what Martin and you suggested, I thought the final code will >> not run fast because of extracting to strand/subset and running each. >> Especially my task is a little more complicated: I need to find gene >> expressions (counting sequences in exonic regions of each gene). I also gave >> BEDTools a try, but it does not fulfil my needs (extremely slow for a gene >> list of 28k). >> >> I ended up with coding a c++ code to do the job. Thanks for all of your >> suggestions and helps guys. >> >> D. >>
ADD REPLYlink written 8.8 years ago by Duke210
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 197 users visited in the last hour