Using precede()/follow() to find two ranges

0

Entering edit mode

d r ▴ 150

@d-r-5459

Last seen 6.2 years ago

Israel

Hello I am looking for a way to find the two preceiding/folliwng ranges in one GRanges object to each range in a second GRanges object. In other words, I am looking for a variation on precede() or follow() that will return for each range in x two ranges in subject instead of one: the nearest preceidng(following) range and the second nearest preceding (following) range. Is there a way to do this? (sorry for not giving any more constructive suggestions, I know relativily little on how GRanges works any therefore am at a loss as to how to proceed) Thanks in advacne Dolev Rahat [[alternative HTML version deleted]]

• 1.7k views

ADD COMMENT • link updated 10.7 years ago by Steve Lianoglou ★ 13k • written 10.7 years ago by d r ▴ 150

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

Hi, On Wed, Aug 21, 2013 at 7:25 AM, d r <dolevrahat at="" gmail.com=""> wrote: > Hello > > I am looking for a way to find the two preceiding/folliwng ranges in one > GRanges object to each range in a second GRanges object. > > In other words, I am looking for a variation on precede() or follow() that > will return for each range in x two ranges in subject instead of one: the > nearest preceidng(following) range and the second nearest preceding > (following) range. > > Is there a way to do this? (sorry for not giving any more constructive > suggestions, I know relativily little on how GRanges works any therefore > am at a loss as to how to proceed) A first draft way of doing that would be to simply use the results from the first precede to remove those elements from the second call? For instance, if you have two ranges you are querying, gr1 and gr2: To get the immediately preceding ranges: R> p1 <- precede(gr1, gr2) Then to get the ones immediately after that: R> gr2.less <- gr2[-p1] R> p2 <- precede(gr1, gr2.less) Then you can see who is who with `gr2[p1]` and the who-who is who with `gr2.less[p2]` ... that should get you pretty close -- will likely have to handle edge cases, for instance when there are no preceding ranges, I think you get an NA (if I recall) so think about what you want to do with those. -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech

ADD COMMENT • link 10.7 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Steve and all Thanks for your suggestion. I was in fact thinking in this direction. However, by completly discarding the first set of hits before querying the second time I run the risk of missing ranges that may be potential second hits beacuse they were discarded after being a first hit for other ranges. For examle, if I have these two GRanges: gr1: GRanges with 2 ranges and 1 metadata column: seqnames ranges strand | names <rle> <iranges> <rle> | <factor> [1] chr6 [125284212, 125284212] * | cg00991794 [2] chr6 [150465049, 150465049] * | cg02250071 gr2: GRanges with 4 ranges and 1 metadata column: seqnames ranges strand | gene <rle> <iranges> <rle> | <factor> [1] chr6 [126284212, 124284212] + | PGM3 [2] chr6 [160920998, 150920998] + | PLEKHG1 [3] chr6 [ 83903012, 83903012] + | PGM3 [4] chr6 [190102159, 170102159] + | WDR27 The first hits will be gr2[1] and gr2[2], which will both be discarded. Now if I call precede() again the hits I will get will be gr2[3] and gr[4], instead of getting again gr2[1] for gr1[2] and gr2[2] for gr1[1]. I was thinking that applying a function that will do what Steve suggested might do the trick if it can run on one range of gr1 at a time without modifying gr2 something along the lines of: two_hits_apply<-function(gr1,gr2) { p1<-precede(gr1,gr2) gr2.less<-gr2[-p1] p2<-precede(gr1,gr2.less) hits<-c(p1,p2) hits } now all I need is a way to apply a functuon over two GRanegs object. If I get it right, I will need somehting like mapply(), but that can actuaaly work on GRanges. Is such a function exists, or alternativly, is there a way to do this with the convential apply functions that I miss? Many thanks in advance Dolev On Wed, Aug 21, 2013 at 7:10 PM, Steve Lianoglou <lianoglou.steve@gene.com>wrote: > Hi, > > On Wed, Aug 21, 2013 at 7:25 AM, d r <dolevrahat@gmail.com> wrote: > > Hello > > > > I am looking for a way to find the two preceiding/folliwng ranges in one > > GRanges object to each range in a second GRanges object. > > > > In other words, I am looking for a variation on precede() or follow() > that > > will return for each range in x two ranges in subject instead of one: the > > nearest preceidng(following) range and the second nearest preceding > > (following) range. > > > > Is there a way to do this? (sorry for not giving any more constructive > > suggestions, I know relativily little on how GRanges works any therefore > > am at a loss as to how to proceed) > > A first draft way of doing that would be to simply use the results > from the first precede to remove those elements from the second call? > > For instance, if you have two ranges you are querying, gr1 and gr2: > > To get the immediately preceding ranges: > > R> p1 <- precede(gr1, gr2) > > Then to get the ones immediately after that: > > R> gr2.less <- gr2[-p1] > R> p2 <- precede(gr1, gr2.less) > > Then you can see who is who with `gr2[p1]` and the who-who is who with > `gr2.less[p2]` ... that should get you pretty close -- will likely > have to handle edge cases, for instance when there are no preceding > ranges, I think you get an NA (if I recall) so think about what you > want to do with those. > > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > [[alternative HTML version deleted]]

ADD REPLY • link 10.7 years ago d r ▴ 150

0

Entering edit mode

Dolev, Before chiming in on the problem as it is currently framed, I want to make sure of something. The documentation for precede and follow read 'Overlapping ranges are excluded.'. For instance if one of your cg ranges were to overlap one of your gene ranges, any method based on precede/follow will not discover this association. Is this really desirable for your application? -Malcolm >-----Original Message----- >From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of d r >Sent: Wednesday, August 21, 2013 12:15 PM >To: Steve Lianoglou >Cc: bioconductor at r-project.org list >Subject: Re: [BioC] Using precede()/follow() to find two ranges > >Hi Steve and all > >Thanks for your suggestion. I was in fact thinking in this direction. >However, by completly discarding the first set of hits before querying the >second time I run the risk of missing ranges that may be potential second >hits beacuse they were discarded after being a first hit for other ranges. >For examle, if I have these two GRanges: > >gr1: > >GRanges with 2 ranges and 1 metadata column: > > seqnames ranges strand | names > > <rle> <iranges> <rle> | <factor> > > [1] chr6 [125284212, 125284212] * | cg00991794 > > [2] chr6 [150465049, 150465049] * | cg02250071 > > > > > >gr2: > >GRanges with 4 ranges and 1 metadata column: > > seqnames ranges strand | gene > > <rle> <iranges> <rle> | <factor> > > [1] chr6 [126284212, 124284212] + | PGM3 > > [2] chr6 [160920998, 150920998] + | PLEKHG1 > > [3] chr6 [ 83903012, 83903012] + | PGM3 > > [4] chr6 [190102159, 170102159] + | WDR27 > >The first hits will be gr2[1] and gr2[2], which will both be discarded. > >Now if I call precede() again the hits I will get will be gr2[3] and >gr[4], instead of getting again gr2[1] for gr1[2] and gr2[2] for >gr1[1]. > > >I was thinking that applying a function that will do what Steve >suggested might do the trick if it can run on one range of gr1 at a >time without modifying gr2 > >something along the lines of: > >two_hits_apply<-function(gr1,gr2) >{ > >p1<-precede(gr1,gr2) > >gr2.less<-gr2[-p1] > >p2<-precede(gr1,gr2.less) > >hits<-c(p1,p2) > >hits > >} > >now all I need is a way to apply a functuon over two GRanegs object. >If I get it right, I will need somehting like mapply(), but that can >actuaaly work on GRanges. > >Is such a function exists, or alternativly, is there a way to do this >with the convential apply functions that I miss? > >Many thanks in advance >Dolev > > > > > > > >On Wed, Aug 21, 2013 at 7:10 PM, Steve Lianoglou ><lianoglou.steve at="" gene.com="">wrote: > >> Hi, >> >> On Wed, Aug 21, 2013 at 7:25 AM, d r <dolevrahat at="" gmail.com=""> wrote: >> > Hello >> > >> > I am looking for a way to find the two preceiding/folliwng ranges in one >> > GRanges object to each range in a second GRanges object. >> > >> > In other words, I am looking for a variation on precede() or follow() >> that >> > will return for each range in x two ranges in subject instead of one: the >> > nearest preceidng(following) range and the second nearest preceding >> > (following) range. >> > >> > Is there a way to do this? (sorry for not giving any more constructive >> > suggestions, I know relativily little on how GRanges works any therefore >> > am at a loss as to how to proceed) >> >> A first draft way of doing that would be to simply use the results >> from the first precede to remove those elements from the second call? >> >> For instance, if you have two ranges you are querying, gr1 and gr2: >> >> To get the immediately preceding ranges: >> >> R> p1 <- precede(gr1, gr2) >> >> Then to get the ones immediately after that: >> >> R> gr2.less <- gr2[-p1] >> R> p2 <- precede(gr1, gr2.less) >> >> Then you can see who is who with `gr2[p1]` and the who-who is who with >> `gr2.less[p2]` ... that should get you pretty close -- will likely >> have to handle edge cases, for instance when there are no preceding >> ranges, I think you get an NA (if I recall) so think about what you >> want to do with those. >> >> -steve >> >> -- >> Steve Lianoglou >> Computational Biologist >> Bioinformatics and Computational Biology >> Genentech >> > > [[alternative HTML version deleted]] > >_______________________________________________ >Bioconductor mailing list >Bioconductor at r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 10.7 years ago Malcolm Cook ★ 1.6k

0

Entering edit mode

Hi Malcolm Thanks for your thoughtfulness. For the specific problem I'm working on I intend to consider only the TSSs of the genes, which I will compare against illumina 450k probes, which are also single nucleotide ranges. Therefore, I do not think that overlapping ranges should be an issue here. Of course, it is theoretically possible that a probe may overlap a TSS, but since I'm looking for regulatory elements I am not interested in this type of associations anyhow. So, any insight you may be able to offer on this problem will be highly appreciated. However, I think that for my particular problem overlapping ranges should not be an issue. On Wed, Aug 21, 2013 at 8:59 PM, Cook, Malcolm <mec@stowers.org> wrote: > Dolev, > > Before chiming in on the problem as it is currently framed, I want to make > sure of something. > > The documentation for precede and follow read 'Overlapping ranges are > excluded.'. > > For instance if one of your cg ranges were to overlap one of your gene > ranges, any method based on precede/follow will not discover this > association. > > Is this really desirable for your application? > > -Malcolm > > > >-----Original Message----- > >From: bioconductor-bounces@r-project.org [mailto: > bioconductor-bounces@r-project.org] On Behalf Of d r > >Sent: Wednesday, August 21, 2013 12:15 PM > >To: Steve Lianoglou > >Cc: bioconductor@r-project.org list > >Subject: Re: [BioC] Using precede()/follow() to find two ranges > > > >Hi Steve and all > > > >Thanks for your suggestion. I was in fact thinking in this direction. > >However, by completly discarding the first set of hits before querying > the > >second time I run the risk of missing ranges that may be potential second > >hits beacuse they were discarded after being a first hit for other > ranges. > >For examle, if I have these two GRanges: > > > >gr1: > > > >GRanges with 2 ranges and 1 metadata column: > > > > seqnames ranges strand | names > > > > <rle> <iranges> <rle> | <factor> > > > > [1] chr6 [125284212, 125284212] * | cg00991794 > > > > [2] chr6 [150465049, 150465049] * | cg02250071 > > > > > > > > > > > >gr2: > > > >GRanges with 4 ranges and 1 metadata column: > > > > seqnames ranges strand | gene > > > > <rle> <iranges> <rle> | <factor> > > > > [1] chr6 [126284212, 124284212] + | PGM3 > > > > [2] chr6 [160920998, 150920998] + | PLEKHG1 > > > > [3] chr6 [ 83903012, 83903012] + | PGM3 > > > > [4] chr6 [190102159, 170102159] + | WDR27 > > > >The first hits will be gr2[1] and gr2[2], which will both be discarded. > > > >Now if I call precede() again the hits I will get will be gr2[3] and > >gr[4], instead of getting again gr2[1] for gr1[2] and gr2[2] for > >gr1[1]. > > > > > >I was thinking that applying a function that will do what Steve > >suggested might do the trick if it can run on one range of gr1 at a > >time without modifying gr2 > > > >something along the lines of: > > > >two_hits_apply<-function(gr1,gr2) > >{ > > > >p1<-precede(gr1,gr2) > > > >gr2.less<-gr2[-p1] > > > >p2<-precede(gr1,gr2.less) > > > >hits<-c(p1,p2) > > > >hits > > > >} > > > >now all I need is a way to apply a functuon over two GRanegs object. > >If I get it right, I will need somehting like mapply(), but that can > >actuaaly work on GRanges. > > > >Is such a function exists, or alternativly, is there a way to do this > >with the convential apply functions that I miss? > > > >Many thanks in advance > >Dolev > > > > > > > > > > > > > > > >On Wed, Aug 21, 2013 at 7:10 PM, Steve Lianoglou > ><lianoglou.steve@gene.com>wrote: > > > >> Hi, > >> > >> On Wed, Aug 21, 2013 at 7:25 AM, d r <dolevrahat@gmail.com> wrote: > >> > Hello > >> > > >> > I am looking for a way to find the two preceiding/folliwng ranges in > one > >> > GRanges object to each range in a second GRanges object. > >> > > >> > In other words, I am looking for a variation on precede() or follow() > >> that > >> > will return for each range in x two ranges in subject instead of > one: the > >> > nearest preceidng(following) range and the second nearest preceding > >> > (following) range. > >> > > >> > Is there a way to do this? (sorry for not giving any more > constructive > >> > suggestions, I know relativily little on how GRanges works any > therefore > >> > am at a loss as to how to proceed) > >> > >> A first draft way of doing that would be to simply use the results > >> from the first precede to remove those elements from the second call? > >> > >> For instance, if you have two ranges you are querying, gr1 and gr2: > >> > >> To get the immediately preceding ranges: > >> > >> R> p1 <- precede(gr1, gr2) > >> > >> Then to get the ones immediately after that: > >> > >> R> gr2.less <- gr2[-p1] > >> R> p2 <- precede(gr1, gr2.less) > >> > >> Then you can see who is who with `gr2[p1]` and the who-who is who with > >> `gr2.less[p2]` ... that should get you pretty close -- will likely > >> have to handle edge cases, for instance when there are no preceding > >> ranges, I think you get an NA (if I recall) so think about what you > >> want to do with those. > >> > >> -steve > >> > >> -- > >> Steve Lianoglou > >> Computational Biologist > >> Bioinformatics and Computational Biology > >> Genentech > >> > > > > [[alternative HTML version deleted]] > > > >_______________________________________________ > >Bioconductor mailing list > >Bioconductor@r-project.org > >https://stat.ethz.ch/mailman/listinfo/bioconductor > >Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 10.7 years ago d r ▴ 150

0

Entering edit mode

Well, there are many ways to skin this kind of cat (apologies to the internet). Indeed, there are many cats that are slight variations on this cat. re: "since I'm looking for regulatory elements I am not interested in this type of associations anyhow" I think that you should probably be handling this else-how. If what you mean is that there should be some minimal distance between your cgs and your genes, then, you should probably explicitly make this a requirement, and not add it as an afterthought when faced with the possibilities. That said, if you are wed to the exact to the exact results you specified, and you want explicitly to exclude overlaps with your cgs, I think you will find that you will want to compose something along these lines: 1) use precede and follow each twice to generate your two candidate neighbors 2) use the range or these candidates and search using findOverlaps within them to see if there are others that were skipped Cheers, Malcolm From: d r [mailto:dolevrahat@gmail.com] Sent: Wednesday, August 21, 2013 2:57 PM To: Cook, Malcolm Cc: Steve Lianoglou; bioconductor@r-project.org list Subject: Re: [BioC] Using precede()/follow() to find two ranges Hi Malcolm Thanks for your thoughtfulness. For the specific problem I'm working on I intend to consider only the TSSs of the genes, which I will compare against illumina 450k probes, which are also single nucleotide ranges. Therefore, I do not think that overlapping ranges should be an issue here. Of course, it is theoretically possible that a probe may overlap a TSS, but since I'm looking for regulatory elements I am not interested in this type of associations anyhow. So, any insight you may be able to offer on this problem will be highly appreciated. However, I think that for my particular problem overlapping ranges should not be an issue. On Wed, Aug 21, 2013 at 8:59 PM, Cook, Malcolm <mec@stowers.org<mailto:mec@stowers.org>> wrote: Dolev, Before chiming in on the problem as it is currently framed, I want to make sure of something. The documentation for precede and follow read 'Overlapping ranges are excluded.'. For instance if one of your cg ranges were to overlap one of your gene ranges, any method based on precede/follow will not discover this association. Is this really desirable for your application? -Malcolm >-----Original Message----- >From: bioconductor-bounces@r-project.org<mailto:bioconductor- bounces@r-project.org=""> [mailto:bioconductor- bounces@r-project.org<mailto:bioconductor-bounces@r-project.org>] On Behalf Of d r >Sent: Wednesday, August 21, 2013 12:15 PM >To: Steve Lianoglou >Cc: bioconductor@r-project.org<mailto:bioconductor@r-project.org> list >Subject: Re: [BioC] Using precede()/follow() to find two ranges > >Hi Steve and all > >Thanks for your suggestion. I was in fact thinking in this direction. >However, by completly discarding the first set of hits before querying the >second time I run the risk of missing ranges that may be potential second >hits beacuse they were discarded after being a first hit for other ranges. >For examle, if I have these two GRanges: > >gr1: > >GRanges with 2 ranges and 1 metadata column: > > seqnames ranges strand | names > > <rle> <iranges> <rle> | <factor> > > [1] chr6 [125284212, 125284212] * | cg00991794 > > [2] chr6 [150465049, 150465049] * | cg02250071 > > > > > >gr2: > >GRanges with 4 ranges and 1 metadata column: > > seqnames ranges strand | gene > > <rle> <iranges> <rle> | <factor> > > [1] chr6 [126284212, 124284212] + | PGM3 > > [2] chr6 [160920998, 150920998] + | PLEKHG1 > > [3] chr6 [ 83903012, 83903012] + | PGM3 > > [4] chr6 [190102159, 170102159] + | WDR27 > >The first hits will be gr2[1] and gr2[2], which will both be discarded. > >Now if I call precede() again the hits I will get will be gr2[3] and >gr[4], instead of getting again gr2[1] for gr1[2] and gr2[2] for >gr1[1]. > > >I was thinking that applying a function that will do what Steve >suggested might do the trick if it can run on one range of gr1 at a >time without modifying gr2 > >something along the lines of: > >two_hits_apply<-function(gr1,gr2) >{ > >p1<-precede(gr1,gr2) > >gr2.less<-gr2[-p1] > >p2<-precede(gr1,gr2.less) > >hits<-c(p1,p2) > >hits > >} > >now all I need is a way to apply a functuon over two GRanegs object. >If I get it right, I will need somehting like mapply(), but that can >actuaaly work on GRanges. > >Is such a function exists, or alternativly, is there a way to do this >with the convential apply functions that I miss? > >Many thanks in advance >Dolev > > > > > > > >On Wed, Aug 21, 2013 at 7:10 PM, Steve Lianoglou ><lianoglou.steve@gene.com<mailto:lianoglou.steve@gene.com>>wrote: > >> Hi, >> >> On Wed, Aug 21, 2013 at 7:25 AM, d r <dolevrahat@gmail.com<mailto:dolevrahat@gmail.com>> wrote: >> > Hello >> > >> > I am looking for a way to find the two preceiding/folliwng ranges in one >> > GRanges object to each range in a second GRanges object. >> > >> > In other words, I am looking for a variation on precede() or follow() >> that >> > will return for each range in x two ranges in subject instead of one: the >> > nearest preceidng(following) range and the second nearest preceding >> > (following) range. >> > >> > Is there a way to do this? (sorry for not giving any more constructive >> > suggestions, I know relativily little on how GRanges works any therefore >> > am at a loss as to how to proceed) >> >> A first draft way of doing that would be to simply use the results >> from the first precede to remove those elements from the second call? >> >> For instance, if you have two ranges you are querying, gr1 and gr2: >> >> To get the immediately preceding ranges: >> >> R> p1 <- precede(gr1, gr2) >> >> Then to get the ones immediately after that: >> >> R> gr2.less <- gr2[-p1] >> R> p2 <- precede(gr1, gr2.less) >> >> Then you can see who is who with `gr2[p1]` and the who-who is who with >> `gr2.less[p2]` ... that should get you pretty close -- will likely >> have to handle edge cases, for instance when there are no preceding >> ranges, I think you get an NA (if I recall) so think about what you >> want to do with those. >> >> -steve >> >> -- >> Steve Lianoglou >> Computational Biologist >> Bioinformatics and Computational Biology >> Genentech >> > > [[alternative HTML version deleted]] > >_______________________________________________ >Bioconductor mailing list >Bioconductor@r-project.org<mailto:bioconductor@r-project.org> >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 10.7 years ago Malcolm Cook ★ 1.6k

0

Entering edit mode

On 08/21/2013 01:46 PM, Cook, Malcolm wrote: > Well, there are many ways to skin this kind of cat (apologies to the internet). > > Indeed, there are many cats that are slight variations on this cat. > > re: "since I'm looking for regulatory elements I am not interested in this type of associations anyhow" > > I think that you should probably be handling this else-how. If what you mean is that there should be some minimal distance between your cgs and your genes, then, you should probably explicitly make this a requirement, and not add it as an afterthought when faced with the possibilities. > > That said, if you are wed to the exact to the exact results you specified, and you want explicitly to exclude overlaps with your cgs, I think you will find that you will want to compose something along these lines: > > > 1) use precede and follow each twice to generate your two candidate neighbors Not sure if this is what Malcolm meant, or if I'm being a little simplistic, but maybe with subj = GRanges("A", IRanges(seq(10, 50, 10), width=1), "+") query = GRanges("A", IRanges(seq(31, 51, 10), width=1), "+") The first range that query follows is f1 = follow(query, subj) and the second is f2 = follow(subj[f1], subj) giving the indexes into subj as > f1 [1] 3 4 5 > f2 [1] 2 3 4 I guess there are problems when there is no following range (so need to check for NA's before doing the second follow? Another strategy might simply sort(subj) and then f2 = f1 - 1 (except when seqnames(subj)[f1] != seqnames(subj)[f2]). Probably strand will confusing things! Martin > > 2) use the range or these candidates and search using findOverlaps within them to see if there are others that were skipped > > Cheers, > > Malcolm > > From: d r [mailto:dolevrahat at gmail.com] > Sent: Wednesday, August 21, 2013 2:57 PM > To: Cook, Malcolm > Cc: Steve Lianoglou; bioconductor at r-project.org list > Subject: Re: [BioC] Using precede()/follow() to find two ranges > > Hi Malcolm > Thanks for your thoughtfulness. > For the specific problem I'm working on I intend to consider only the TSSs of the genes, which I will compare against illumina 450k probes, which are also single nucleotide ranges. > Therefore, I do not think that overlapping ranges should be an issue here. Of course, it is theoretically possible that a probe may overlap a TSS, but since I'm looking for regulatory elements I am not interested in this type of associations anyhow. > So, any insight you may be able to offer on this problem will be highly appreciated. > > However, I think that for my particular problem overlapping ranges should not be an issue. > > On Wed, Aug 21, 2013 at 8:59 PM, Cook, Malcolm <mec at="" stowers.org<mailto:mec="" at="" stowers.org="">> wrote: > Dolev, > > Before chiming in on the problem as it is currently framed, I want to make sure of something. > > The documentation for precede and follow read 'Overlapping ranges are excluded.'. > > For instance if one of your cg ranges were to overlap one of your gene ranges, any method based on precede/follow will not discover this association. > > Is this really desirable for your application? > > -Malcolm > > > >-----Original Message----- > >From: bioconductor-bounces at r-project.org<mailto:bioconductor- bounces="" at="" r-project.org=""> [mailto:bioconductor-bounces at r-project.org<mailto:bioconductor-bounces at="" r-project.org="">] On Behalf Of d r > >Sent: Wednesday, August 21, 2013 12:15 PM > >To: Steve Lianoglou > >Cc: bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> list > >Subject: Re: [BioC] Using precede()/follow() to find two ranges > > > >Hi Steve and all > > > >Thanks for your suggestion. I was in fact thinking in this direction. > >However, by completly discarding the first set of hits before querying the > >second time I run the risk of missing ranges that may be potential second > >hits beacuse they were discarded after being a first hit for other ranges. > >For examle, if I have these two GRanges: > > > >gr1: > > > >GRanges with 2 ranges and 1 metadata column: > > > > seqnames ranges strand | names > > > > <rle> <iranges> <rle> | <factor> > > > > [1] chr6 [125284212, 125284212] * | cg00991794 > > > > [2] chr6 [150465049, 150465049] * | cg02250071 > > > > > > > > > > > >gr2: > > > >GRanges with 4 ranges and 1 metadata column: > > > > seqnames ranges strand | gene > > > > <rle> <iranges> <rle> | <factor> > > > > [1] chr6 [126284212, 124284212] + | PGM3 > > > > [2] chr6 [160920998, 150920998] + | PLEKHG1 > > > > [3] chr6 [ 83903012, 83903012] + | PGM3 > > > > [4] chr6 [190102159, 170102159] + | WDR27 > > > >The first hits will be gr2[1] and gr2[2], which will both be discarded. > > > >Now if I call precede() again the hits I will get will be gr2[3] and > >gr[4], instead of getting again gr2[1] for gr1[2] and gr2[2] for > >gr1[1]. > > > > > >I was thinking that applying a function that will do what Steve > >suggested might do the trick if it can run on one range of gr1 at a > >time without modifying gr2 > > > >something along the lines of: > > > >two_hits_apply<-function(gr1,gr2) > >{ > > > >p1<-precede(gr1,gr2) > > > >gr2.less<-gr2[-p1] > > > >p2<-precede(gr1,gr2.less) > > > >hits<-c(p1,p2) > > > >hits > > > >} > > > >now all I need is a way to apply a functuon over two GRanegs object. > >If I get it right, I will need somehting like mapply(), but that can > >actuaaly work on GRanges. > > > >Is such a function exists, or alternativly, is there a way to do this > >with the convential apply functions that I miss? > > > >Many thanks in advance > >Dolev > > > > > > > > > > > > > > > >On Wed, Aug 21, 2013 at 7:10 PM, Steve Lianoglou > ><lianoglou.steve at="" gene.com<mailto:lianoglou.steve="" at="" gene.com="">>wrote: > > > >> Hi, > >> > >> On Wed, Aug 21, 2013 at 7:25 AM, d r <dolevrahat at="" gmail.com<mailto:dolevrahat="" at="" gmail.com="">> wrote: > >> > Hello > >> > > >> > I am looking for a way to find the two preceiding/folliwng ranges in one > >> > GRanges object to each range in a second GRanges object. > >> > > >> > In other words, I am looking for a variation on precede() or follow() > >> that > >> > will return for each range in x two ranges in subject instead of one: the > >> > nearest preceidng(following) range and the second nearest preceding > >> > (following) range. > >> > > >> > Is there a way to do this? (sorry for not giving any more constructive > >> > suggestions, I know relativily little on how GRanges works any therefore > >> > am at a loss as to how to proceed) > >> > >> A first draft way of doing that would be to simply use the results > >> from the first precede to remove those elements from the second call? > >> > >> For instance, if you have two ranges you are querying, gr1 and gr2: > >> > >> To get the immediately preceding ranges: > >> > >> R> p1 <- precede(gr1, gr2) > >> > >> Then to get the ones immediately after that: > >> > >> R> gr2.less <- gr2[-p1] > >> R> p2 <- precede(gr1, gr2.less) > >> > >> Then you can see who is who with `gr2[p1]` and the who-who is who with > >> `gr2.less[p2]` ... that should get you pretty close -- will likely > >> have to handle edge cases, for instance when there are no preceding > >> ranges, I think you get an NA (if I recall) so think about what you > >> want to do with those. > >> > >> -steve > >> > >> -- > >> Steve Lianoglou > >> Computational Biologist > >> Bioinformatics and Computational Biology > >> Genentech > >> > > > > [[alternative HTML version deleted]] > > > >_______________________________________________ > >Bioconductor mailing list > >Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > >https://stat.ethz.ch/mailman/listinfo/bioconductor > >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 10.7 years ago Martin Morgan 25k

0

Entering edit mode

Thanks Martin, this sound like a useful approach. I will try to implement it. Malcolm - I am not sure that I understood you're comments. I do not "want explicitly to exclude overlaps with my cgs". Rather, given that my particular problem invloves intersecting single nuclotide ranges, I do not think that the problem of overlaps is likely to arise. Am I missing something here? Thanks agian for all of your for being so helpful Dolev On Thu, Aug 22, 2013 at 7:57 AM, Martin Morgan <mtmorgan@fhcrc.org> wrote: > On 08/21/2013 01:46 PM, Cook, Malcolm wrote: > >> Well, there are many ways to skin this kind of cat (apologies to the >> internet). >> >> Indeed, there are many cats that are slight variations on this cat. >> >> re: "since I'm looking for regulatory elements I am not interested in >> this type of associations anyhow" >> >> I think that you should probably be handling this else-how. If what you >> mean is that there should be some minimal distance between your cgs and >> your genes, then, you should probably explicitly make this a requirement, >> and not add it as an afterthought when faced with the possibilities. >> >> That said, if you are wed to the exact to the exact results you >> specified, and you want explicitly to exclude overlaps with your cgs, I >> think you will find that you will want to compose something along these >> lines: >> >> >> 1) use precede and follow each twice to generate your two candidate >> neighbors >> > > Not sure if this is what Malcolm meant, or if I'm being a little > simplistic, but maybe with > > subj = GRanges("A", IRanges(seq(10, 50, 10), width=1), "+") > query = GRanges("A", IRanges(seq(31, 51, 10), width=1), "+") > > The first range that query follows is > > f1 = follow(query, subj) > > and the second is > > f2 = follow(subj[f1], subj) > > giving the indexes into subj as > > > f1 > [1] 3 4 5 > > f2 > [1] 2 3 4 > > I guess there are problems when there is no following range (so need to > check for NA's before doing the second follow? > > Another strategy might simply > > sort(subj) > > and then f2 = f1 - 1 (except when seqnames(subj)[f1] != > seqnames(subj)[f2]). Probably strand will confusing things! > > Martin > > >> 2) use the range or these candidates and search using findOverlaps >> within them to see if there are others that were skipped >> >> Cheers, >> >> Malcolm >> >> From: d r [mailto:dolevrahat@gmail.com] >> Sent: Wednesday, August 21, 2013 2:57 PM >> To: Cook, Malcolm >> Cc: Steve Lianoglou; bioconductor@r-project.org list >> Subject: Re: [BioC] Using precede()/follow() to find two ranges >> >> Hi Malcolm >> Thanks for your thoughtfulness. >> For the specific problem I'm working on I intend to consider only the >> TSSs of the genes, which I will compare against illumina 450k probes, which >> are also single nucleotide ranges. >> Therefore, I do not think that overlapping ranges should be an issue >> here. Of course, it is theoretically possible that a probe may overlap a >> TSS, but since I'm looking for regulatory elements I am not interested in >> this type of associations anyhow. >> So, any insight you may be able to offer on this problem will be highly >> appreciated. >> >> However, I think that for my particular problem overlapping ranges should >> not be an issue. >> >> On Wed, Aug 21, 2013 at 8:59 PM, Cook, Malcolm <mec@stowers.org<mailto:>> MEC@**stowers.org <mec@stowers.org>>> wrote: >> Dolev, >> >> Before chiming in on the problem as it is currently framed, I want to >> make sure of something. >> >> The documentation for precede and follow read 'Overlapping ranges are >> excluded.'. >> >> For instance if one of your cg ranges were to overlap one of your gene >> ranges, any method based on precede/follow will not discover this >> association. >> >> Is this really desirable for your application? >> >> -Malcolm >> >> >> >-----Original Message----- >> >From: bioconductor-bounces@r-**project.org<bioconductor- bounces@r-project.org=""> >> <mailto:bioconducto**r-bounces@r-project.org<bioconductor- bounces@r-project.org="">> >> [mailto:bioconductor-bounces@**r-project.org<bioconductor- bounces@r-project.org=""> >> <mailto:bioconduc**tor-bounces@r-project.org<bioconductor- bounces@r-project.org="">>] >> On Behalf Of d r >> >Sent: Wednesday, August 21, 2013 12:15 PM >> >To: Steve Lianoglou >> >Cc: bioconductor@r-project.org<**mailto:bioconductor@r-project.* *org<bioconductor@r-project.org>> >> list >> >Subject: Re: [BioC] Using precede()/follow() to find two ranges >> > >> >Hi Steve and all >> > >> >Thanks for your suggestion. I was in fact thinking in this direction. >> >However, by completly discarding the first set of hits before querying >> the >> >second time I run the risk of missing ranges that may be potential >> second >> >hits beacuse they were discarded after being a first hit for other >> ranges. >> >For examle, if I have these two GRanges: >> > >> >gr1: >> > >> >GRanges with 2 ranges and 1 metadata column: >> > >> > seqnames ranges strand | names >> > >> > <rle> <iranges> <rle> | <factor> >> > >> > [1] chr6 [125284212, 125284212] * | cg00991794 >> > >> > [2] chr6 [150465049, 150465049] * | cg02250071 >> > >> > >> > >> > >> > >> >gr2: >> > >> >GRanges with 4 ranges and 1 metadata column: >> > >> > seqnames ranges strand | gene >> > >> > <rle> <iranges> <rle> | <factor> >> > >> > [1] chr6 [126284212, 124284212] + | PGM3 >> > >> > [2] chr6 [160920998, 150920998] + | PLEKHG1 >> > >> > [3] chr6 [ 83903012, 83903012] + | PGM3 >> > >> > [4] chr6 [190102159, 170102159] + | WDR27 >> > >> >The first hits will be gr2[1] and gr2[2], which will both be discarded. >> > >> >Now if I call precede() again the hits I will get will be gr2[3] and >> >gr[4], instead of getting again gr2[1] for gr1[2] and gr2[2] for >> >gr1[1]. >> > >> > >> >I was thinking that applying a function that will do what Steve >> >suggested might do the trick if it can run on one range of gr1 at a >> >time without modifying gr2 >> > >> >something along the lines of: >> > >> >two_hits_apply<-function(gr1,**gr2) >> >{ >> > >> >p1<-precede(gr1,gr2) >> > >> >gr2.less<-gr2[-p1] >> > >> >p2<-precede(gr1,gr2.less) >> > >> >hits<-c(p1,p2) >> > >> >hits >> > >> >} >> > >> >now all I need is a way to apply a functuon over two GRanegs object. >> >If I get it right, I will need somehting like mapply(), but that can >> >actuaaly work on GRanges. >> > >> >Is such a function exists, or alternativly, is there a way to do this >> >with the convential apply functions that I miss? >> > >> >Many thanks in advance >> >Dolev >> > >> > >> > >> > >> > >> > >> > >> >On Wed, Aug 21, 2013 at 7:10 PM, Steve Lianoglou >> ><lianoglou.steve@gene.com<**mailto:lianoglou.steve@gene.**com<li anoglou.steve@gene.com=""> >> >>wrote: >> > >> >> Hi, >> >> >> >> >> On Wed, Aug 21, 2013 at 7:25 AM, d r <dolevrahat@gmail.com<mailto:d*>> *olevrahat@gmail.com <dolevrahat@gmail.com>>> wrote: >> >> > Hello >> >> > >> >> > I am looking for a way to find the two preceiding/folliwng ranges >> in one >> >> > GRanges object to each range in a second GRanges object. >> >> > >> >> > In other words, I am looking for a variation on precede() or >> follow() >> >> that >> >> > will return for each range in x two ranges in subject instead of >> one: the >> >> > nearest preceidng(following) range and the second nearest preceding >> >> > (following) range. >> >> > >> >> > Is there a way to do this? (sorry for not giving any more >> constructive >> >> > suggestions, I know relativily little on how GRanges works any >> therefore >> >> > am at a loss as to how to proceed) >> >> >> >> A first draft way of doing that would be to simply use the results >> >> from the first precede to remove those elements from the second call? >> >> >> >> For instance, if you have two ranges you are querying, gr1 and gr2: >> >> >> >> To get the immediately preceding ranges: >> >> >> >> R> p1 <- precede(gr1, gr2) >> >> >> >> Then to get the ones immediately after that: >> >> >> >> R> gr2.less <- gr2[-p1] >> >> R> p2 <- precede(gr1, gr2.less) >> >> >> >> Then you can see who is who with `gr2[p1]` and the who-who is who >> with >> >> `gr2.less[p2]` ... that should get you pretty close -- will likely >> >> have to handle edge cases, for instance when there are no preceding >> >> ranges, I think you get an NA (if I recall) so think about what you >> >> want to do with those. >> >> >> >> -steve >> >> >> >> -- >> >> Steve Lianoglou >> >> Computational Biologist >> >> Bioinformatics and Computational Biology >> >> Genentech >> >> >> > >> > [[alternative HTML version deleted]] >> > >> >_____________________________**__________________ >> >Bioconductor mailing list >> >Bioconductor@r-project.org<**mailto:Bioconductor@r-project.**org <bioconductor@r-project.org> >> > >> >> >https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: sta="" t.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> >Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> >> [[alternative HTML version deleted]] >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > [[alternative HTML version deleted]]

ADD REPLY • link 10.7 years ago d r ▴ 150

0

Entering edit mode

Dolev, If there are no overlaps within the ranges of your subjects then approaches involving precede and follow will of course not be effected by overlaps between ranges in your subjects, since, well, there aren't any. Still be careful with approaches suing precede/follow alone since any of your 1 nt wide query ranges which are themselves exactly present in your 1 nt-wide subject ranges won't be found using precede/follow. ~Malcolm From: d r [mailto:dolevrahat@gmail.com] Sent: Thursday, August 22, 2013 2:40 AM To: Martin Morgan Cc: Cook, Malcolm; Steve Lianoglou; bioconductor@r-project.org list Subject: Re: [BioC] Using precede()/follow() to find two ranges Thanks Martin, this sound like a useful approach. I will try to implement it. Malcolm - I am not sure that I understood you're comments. I do not "want explicitly to exclude overlaps with my cgs". Rather, given that my particular problem invloves intersecting single nuclotide ranges, I do not think that the problem of overlaps is likely to arise. Am I missing something here? Thanks agian for all of your for being so helpful Dolev On Thu, Aug 22, 2013 at 7:57 AM, Martin Morgan <mtmorgan@fhcrc.org<mailto:mtmorgan@fhcrc.org>> wrote: On 08/21/2013 01:46 PM, Cook, Malcolm wrote: Well, there are many ways to skin this kind of cat (apologies to the internet). Indeed, there are many cats that are slight variations on this cat. re: "since I'm looking for regulatory elements I am not interested in this type of associations anyhow" I think that you should probably be handling this else-how. If what you mean is that there should be some minimal distance between your cgs and your genes, then, you should probably explicitly make this a requirement, and not add it as an afterthought when faced with the possibilities. That said, if you are wed to the exact to the exact results you specified, and you want explicitly to exclude overlaps with your cgs, I think you will find that you will want to compose something along these lines: 1) use precede and follow each twice to generate your two candidate neighbors Not sure if this is what Malcolm meant, or if I'm being a little simplistic, but maybe with subj = GRanges("A", IRanges(seq(10, 50, 10), width=1), "+") query = GRanges("A", IRanges(seq(31, 51, 10), width=1), "+") The first range that query follows is f1 = follow(query, subj) and the second is f2 = follow(subj[f1], subj) giving the indexes into subj as > f1 [1] 3 4 5 > f2 [1] 2 3 4 I guess there are problems when there is no following range (so need to check for NA's before doing the second follow? Another strategy might simply sort(subj) and then f2 = f1 - 1 (except when seqnames(subj)[f1] != seqnames(subj)[f2]). Probably strand will confusing things! Martin 2) use the range or these candidates and search using findOverlaps within them to see if there are others that were skipped Cheers, Malcolm From: d r [mailto:dolevrahat@gmail.com<mailto:dolevrahat@gmail.com>] Sent: Wednesday, August 21, 2013 2:57 PM To: Cook, Malcolm Cc: Steve Lianoglou; bioconductor@r-project.org<mailto:bioconductor@r-project.org> list Subject: Re: [BioC] Using precede()/follow() to find two ranges Hi Malcolm Thanks for your thoughtfulness. For the specific problem I'm working on I intend to consider only the TSSs of the genes, which I will compare against illumina 450k probes, which are also single nucleotide ranges. Therefore, I do not think that overlapping ranges should be an issue here. Of course, it is theoretically possible that a probe may overlap a TSS, but since I'm looking for regulatory elements I am not interested in this type of associations anyhow. So, any insight you may be able to offer on this problem will be highly appreciated. However, I think that for my particular problem overlapping ranges should not be an issue. On Wed, Aug 21, 2013 at 8:59 PM, Cook, Malcolm <mec@stowers.org<mailto :mec@stowers.org=""><mailto:mec@stowers.org<mailto:mec@stowers.org>>> wrote: Dolev, Before chiming in on the problem as it is currently framed, I want to make sure of something. The documentation for precede and follow read 'Overlapping ranges are excluded.'. For instance if one of your cg ranges were to overlap one of your gene ranges, any method based on precede/follow will not discover this association. Is this really desirable for your application? -Malcolm >-----Original Message----- >From: bioconductor-bounces@r-project.org<mailto:bioconductor- bounces@r-project.org=""><mailto:bioconductor- bounces@r-project.org<mailto:bioconductor-bounces@r-project.org="">> [mailto:bioconductor-bounces@r-project.org<mailto:bioconductor- bounces@r-project.org=""><mailto:bioconductor- bounces@r-project.org<mailto:bioconductor-bounces@r-project.org="">>] On Behalf Of d r >Sent: Wednesday, August 21, 2013 12:15 PM >To: Steve Lianoglou >Cc: bioconductor@r-project.org<mailto:bioconductor@r-project.org>> list >Subject: Re: [BioC] Using precede()/follow() to find two ranges > >Hi Steve and all > >Thanks for your suggestion. I was in fact thinking in this direction. >However, by completly discarding the first set of hits before querying the >second time I run the risk of missing ranges that may be potential second >hits beacuse they were discarded after being a first hit for other ranges. >For examle, if I have these two GRanges: > >gr1: > >GRanges with 2 ranges and 1 metadata column: > > seqnames ranges strand | names > > <rle> <iranges> <rle> | <factor> > > [1] chr6 [125284212, 125284212] * | cg00991794 > > [2] chr6 [150465049, 150465049] * | cg02250071 > > > > > >gr2: > >GRanges with 4 ranges and 1 metadata column: > > seqnames ranges strand | gene > > <rle> <iranges> <rle> | <factor> > > [1] chr6 [126284212, 124284212] + | PGM3 > > [2] chr6 [160920998, 150920998] + | PLEKHG1 > > [3] chr6 [ 83903012, 83903012] + | PGM3 > > [4] chr6 [190102159, 170102159] + | WDR27 > >The first hits will be gr2[1] and gr2[2], which will both be discarded. > >Now if I call precede() again the hits I will get will be gr2[3] and >gr[4], instead of getting again gr2[1] for gr1[2] and gr2[2] for >gr1[1]. > > >I was thinking that applying a function that will do what Steve >suggested might do the trick if it can run on one range of gr1 at a >time without modifying gr2 > >something along the lines of: > >two_hits_apply<-function(gr1,gr2) >{ > >p1<-precede(gr1,gr2) > >gr2.less<-gr2[-p1] > >p2<-precede(gr1,gr2.less) > >hits<-c(p1,p2) > >hits > >} > >now all I need is a way to apply a functuon over two GRanegs object. >If I get it right, I will need somehting like mapply(), but that can >actuaaly work on GRanges. > >Is such a function exists, or alternativly, is there a way to do this >with the convential apply functions that I miss? > >Many thanks in advance >Dolev > > > > > > > >On Wed, Aug 21, 2013 at 7:10 PM, Steve Lianoglou ><lianoglou.steve@gene.com<mailto:lianoglou.steve@gene.com><mailto:l ianoglou.steve@gene.com<mailto:lianoglou.steve@gene.com="">>>wrote: > >> Hi, >> >> On Wed, Aug 21, 2013 at 7:25 AM, d r <dolevrahat@gmail.com<mailto :dolevrahat@gmail.com=""><mailto:dolevrahat@gmail.com<mailto:dolevrahat@g mail.com="">>> wrote: >> > Hello >> > >> > I am looking for a way to find the two preceiding/folliwng ranges in one >> > GRanges object to each range in a second GRanges object. >> > >> > In other words, I am looking for a variation on precede() or follow() >> that >> > will return for each range in x two ranges in subject instead of one: the >> > nearest preceidng(following) range and the second nearest preceding >> > (following) range. >> > >> > Is there a way to do this? (sorry for not giving any more constructive >> > suggestions, I know relativily little on how GRanges works any therefore >> > am at a loss as to how to proceed) >> >> A first draft way of doing that would be to simply use the results >> from the first precede to remove those elements from the second call? >> >> For instance, if you have two ranges you are querying, gr1 and gr2: >> >> To get the immediately preceding ranges: >> >> R> p1 <- precede(gr1, gr2) >> >> Then to get the ones immediately after that: >> >> R> gr2.less <- gr2[-p1] >> R> p2 <- precede(gr1, gr2.less) >> >> Then you can see who is who with `gr2[p1]` and the who-who is who with >> `gr2.less[p2]` ... that should get you pretty close -- will likely >> have to handle edge cases, for instance when there are no preceding >> ranges, I think you get an NA (if I recall) so think about what you >> want to do with those. >> >> -steve >> >> -- >> Steve Lianoglou >> Computational Biologist >> Bioinformatics and Computational Biology >> Genentech >> > > [[alternative HTML version deleted]] > >_______________________________________________ >Bioconductor mailing list >Bioconductor@r-project.org<mailto:bioconductor@r-project.org><mailt o:bioconductor@r-project.org<mailto:bioconductor@r-project.org="">> >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793<tel:%28206%29%20667-2793> [[alternative HTML version deleted]]

ADD REPLY • link 10.7 years ago Malcolm Cook ★ 1.6k

0

Entering edit mode

Dolev, If your requiremnts should change to include a maximum radius (say, 1000) around your query that you will consider subject hits, you might instead use and approach ol.df<-within(as.data.frame(findOverlaps(resize(query,1000,use.names=F ALSE),subj)),d<-distance(query[queryHits],subj[subjectHits])) now, use your favorite R approach to filter ol.df based on your other criteria. Just sayin, Malcolm >-----Original Message----- >From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of Cook, Malcolm >Sent: Thursday, August 22, 2013 9:25 AM >To: 'd r'; 'Martin Morgan' >Cc: 'Steve Lianoglou'; 'bioconductor at r-project.org list' >Subject: Re: [BioC] Using precede()/follow() to find two ranges > >Dolev, > >If there are no overlaps within the ranges of your subjects then approaches involving precede and follow will of course not be >effected by overlaps between ranges in your subjects, since, well, there aren't any. > >Still be careful with approaches suing precede/follow alone since any of your 1 nt wide query ranges which are themselves exactly >present in your 1 nt-wide subject ranges won't be found using precede/follow. > >~Malcolm > > >From: d r [mailto:dolevrahat at gmail.com] >Sent: Thursday, August 22, 2013 2:40 AM >To: Martin Morgan >Cc: Cook, Malcolm; Steve Lianoglou; bioconductor at r-project.org list >Subject: Re: [BioC] Using precede()/follow() to find two ranges > >Thanks Martin, this sound like a useful approach. I will try to implement it. >Malcolm - I am not sure that I understood you're comments. >I do not "want explicitly to exclude overlaps with my cgs". Rather, given that my particular problem invloves intersecting single >nuclotide ranges, > I do not think that the problem of overlaps is likely to arise. Am I missing something here? > >Thanks agian for all of your for being so helpful >Dolev > >On Thu, Aug 22, 2013 at 7:57 AM, Martin Morgan <mtmorgan at="" fhcrc.org<mailto:mtmorgan="" at="" fhcrc.org="">> wrote: >On 08/21/2013 01:46 PM, Cook, Malcolm wrote: >Well, there are many ways to skin this kind of cat (apologies to the internet). > >Indeed, there are many cats that are slight variations on this cat. > >re: "since I'm looking for regulatory elements I am not interested in this type of associations anyhow" > >I think that you should probably be handling this else-how. If what you mean is that there should be some minimal distance between >your cgs and your genes, then, you should probably explicitly make this a requirement, and not add it as an afterthought when faced >with the possibilities. > >That said, if you are wed to the exact to the exact results you specified, and you want explicitly to exclude overlaps with your cgs, I >think you will find that you will want to compose something along these lines: > > >1) use precede and follow each twice to generate your two candidate neighbors > >Not sure if this is what Malcolm meant, or if I'm being a little simplistic, but maybe with > > subj = GRanges("A", IRanges(seq(10, 50, 10), width=1), "+") > query = GRanges("A", IRanges(seq(31, 51, 10), width=1), "+") > >The first range that query follows is > > f1 = follow(query, subj) > >and the second is > > f2 = follow(subj[f1], subj) > >giving the indexes into subj as > > > f1 > [1] 3 4 5 > > f2 > [1] 2 3 4 > >I guess there are problems when there is no following range (so need to check for NA's before doing the second follow? > >Another strategy might simply > > sort(subj) > >and then f2 = f1 - 1 (except when seqnames(subj)[f1] != seqnames(subj)[f2]). Probably strand will confusing things! > >Martin > >2) use the range or these candidates and search using findOverlaps within them to see if there are others that were skipped > >Cheers, > >Malcolm > >From: d r [mailto:dolevrahat at gmail.com<mailto:dolevrahat at="" gmail.com="">] >Sent: Wednesday, August 21, 2013 2:57 PM >To: Cook, Malcolm >Cc: Steve Lianoglou; bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> list >Subject: Re: [BioC] Using precede()/follow() to find two ranges > >Hi Malcolm >Thanks for your thoughtfulness. >For the specific problem I'm working on I intend to consider only the TSSs of the genes, which I will compare against illumina 450k >probes, which are also single nucleotide ranges. >Therefore, I do not think that overlapping ranges should be an issue here. Of course, it is theoretically possible that a probe may >overlap a TSS, but since I'm looking for regulatory elements I am not interested in this type of associations anyhow. >So, any insight you may be able to offer on this problem will be highly appreciated. > >However, I think that for my particular problem overlapping ranges should not be an issue. >On Wed, Aug 21, 2013 at 8:59 PM, Cook, Malcolm ><mec at="" stowers.org<mailto:mec="" at="" stowers.org=""><mailto:mec at="" stowers.org<mailto:mec="" at="" stowers.org="">>> wrote: >Dolev, > >Before chiming in on the problem as it is currently framed, I want to make sure of something. > >The documentation for precede and follow read 'Overlapping ranges are excluded.'. > >For instance if one of your cg ranges were to overlap one of your gene ranges, any method based on precede/follow will not discover >this association. > >Is this really desirable for your application? > >-Malcolm > > > >-----Original Message----- > >From: bioconductor-bounces at r-project.org<mailto:bioconductor- bounces="" at="" r-project.org=""><mailto:bioconductor-bounces at="" r-="">project.org<mailto:bioconductor-bounces at="" r-project.org="">> [mailto :bioconductor-bounces at r-project.org<mailto:bioconductor->bounces at r-project.org><mailto:bioconductor-bounces at="" r-project.org<mailto:bioconductor-bounces="" at="" r-project.org="">>] On Behalf Of >d r > >Sent: Wednesday, August 21, 2013 12:15 PM > >To: Steve Lianoglou > >Cc: bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""><mailto:bioconductor at="" r-project.org<mailto:bioconductor="" at="" r-="">project.org>> list > >Subject: Re: [BioC] Using precede()/follow() to find two ranges > > > >Hi Steve and all > > > >Thanks for your suggestion. I was in fact thinking in this direction. > >However, by completly discarding the first set of hits before querying the > >second time I run the risk of missing ranges that may be potential second > >hits beacuse they were discarded after being a first hit for other ranges. > >For examle, if I have these two GRanges: > > > >gr1: > > > >GRanges with 2 ranges and 1 metadata column: > > > > seqnames ranges strand | names > > > > <rle> <iranges> <rle> | <factor> > > > > [1] chr6 [125284212, 125284212] * | cg00991794 > > > > [2] chr6 [150465049, 150465049] * | cg02250071 > > > > > > > > > > > >gr2: > > > >GRanges with 4 ranges and 1 metadata column: > > > > seqnames ranges strand | gene > > > > <rle> <iranges> <rle> | <factor> > > > > [1] chr6 [126284212, 124284212] + | PGM3 > > > > [2] chr6 [160920998, 150920998] + | PLEKHG1 > > > > [3] chr6 [ 83903012, 83903012] + | PGM3 > > > > [4] chr6 [190102159, 170102159] + | WDR27 > > > >The first hits will be gr2[1] and gr2[2], which will both be discarded. > > > >Now if I call precede() again the hits I will get will be gr2[3] and > >gr[4], instead of getting again gr2[1] for gr1[2] and gr2[2] for > >gr1[1]. > > > > > >I was thinking that applying a function that will do what Steve > >suggested might do the trick if it can run on one range of gr1 at a > >time without modifying gr2 > > > >something along the lines of: > > > >two_hits_apply<-function(gr1,gr2) > >{ > > > >p1<-precede(gr1,gr2) > > > >gr2.less<-gr2[-p1] > > > >p2<-precede(gr1,gr2.less) > > > >hits<-c(p1,p2) > > > >hits > > > >} > > > >now all I need is a way to apply a functuon over two GRanegs object. > >If I get it right, I will need somehting like mapply(), but that can > >actuaaly work on GRanges. > > > >Is such a function exists, or alternativly, is there a way to do this > >with the convential apply functions that I miss? > > > >Many thanks in advance > >Dolev > > > > > > > > > > > > > > > >On Wed, Aug 21, 2013 at 7:10 PM, Steve Lianoglou > >><lianoglou.steve at="" gene.com<mailto:lianoglou.steve="" at="" gene.com=""><mailto:lianoglou.steve at="" gene.com<mailto:lianoglou.steve="" at="" gene.="">com>>>wrote: > > > >> Hi, > > >> > >> On Wed, Aug 21, 2013 at 7:25 AM, d r ><dolevrahat at="" gmail.com<mailto:dolevrahat="" at="" gmail.com=""><mailto:dolevrahat at="" gmail.com<mailto:dolevrahat="" at="" gmail.com="">>> wrote: > >> > Hello > >> > > >> > I am looking for a way to find the two preceiding/folliwng ranges in one > >> > GRanges object to each range in a second GRanges object. > >> > > >> > In other words, I am looking for a variation on precede() or follow() > >> that > >> > will return for each range in x two ranges in subject instead of one: the > >> > nearest preceidng(following) range and the second nearest preceding > >> > (following) range. > >> > > >> > Is there a way to do this? (sorry for not giving any more constructive > >> > suggestions, I know relativily little on how GRanges works any therefore > >> > am at a loss as to how to proceed) > >> > >> A first draft way of doing that would be to simply use the results > >> from the first precede to remove those elements from the second call? > >> > >> For instance, if you have two ranges you are querying, gr1 and gr2: > >> > >> To get the immediately preceding ranges: > >> > >> R> p1 <- precede(gr1, gr2) > >> > >> Then to get the ones immediately after that: > >> > >> R> gr2.less <- gr2[-p1] > >> R> p2 <- precede(gr1, gr2.less) > >> > >> Then you can see who is who with `gr2[p1]` and the who-who is who with > >> `gr2.less[p2]` ... that should get you pretty close -- will likely > >> have to handle edge cases, for instance when there are no preceding > >> ranges, I think you get an NA (if I recall) so think about what you > >> want to do with those. > >> > >> -steve > >> > >> -- > >> Steve Lianoglou > >> Computational Biologist > >> Bioinformatics and Computational Biology > >> Genentech > >> > > > > [[alternative HTML version deleted]] > > > >_______________________________________________ > >Bioconductor mailing list > >Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""><mailto:bioconductor at="" r-project.org<mailto:bioconductor="" at="" r-="">project.org>> > > >https://stat.ethz.ch/mailman/listinfo/bioconductor > >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > [[alternative HTML version deleted]] > >_______________________________________________ >Bioconductor mailing list >Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > >-- >Computational Biology / Fred Hutchinson Cancer Research Center >1100 Fairview Ave. N. >PO Box 19024 Seattle, WA 98109 > >Location: Arnold Building M1 B861 >Phone: (206) 667-2793<tel:%28206%29%20667-2793> > > > [[alternative HTML version deleted]] > >_______________________________________________ >Bioconductor mailing list >Bioconductor at r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 10.7 years ago Malcolm Cook ★ 1.6k

0

Entering edit mode

>-----Original Message----- >From: Martin Morgan [mailto:mtmorgan at fhcrc.org] >Sent: Wednesday, August 21, 2013 11:58 PM >To: Cook, Malcolm >Cc: 'd r'; 'Steve Lianoglou'; 'bioconductor at r-project.org list' >Subject: Re: [BioC] Using precede()/follow() to find two ranges > >On 08/21/2013 01:46 PM, Cook, Malcolm wrote: >> Well, there are many ways to skin this kind of cat (apologies to the internet). >> >> Indeed, there are many cats that are slight variations on this cat. >> >> re: "since I'm looking for regulatory elements I am not interested in this type of associations anyhow" >> >> I think that you should probably be handling this else-how. If what you mean is that there should be some minimal distance >between your cgs and your genes, then, you should probably explicitly make this a requirement, and not add it as an afterthought >when faced with the possibilities. >> >> That said, if you are wed to the exact to the exact results you specified, and you want explicitly to exclude overlaps with your cgs, I >think you will find that you will want to compose something along these lines: >> >> >> 1) use precede and follow each twice to generate your two candidate neighbors > >Not sure if this is what Malcolm meant, or if I'm being a little simplistic, but >maybe with > > subj = GRanges("A", IRanges(seq(10, 50, 10), width=1), "+") > query = GRanges("A", IRanges(seq(31, 51, 10), width=1), "+") > >The first range that query follows is > > f1 = follow(query, subj) > >and the second is > > f2 = follow(subj[f1], subj)' I think this _is_ being a little simplistic. The complexities arise when/if ranges in subj overlap each other. In such cases, only the closest range to the query will be picked up, and the subsequent overlapping range will be skipped. This is the consequence of the fact that 'Overlapping ranges are excluded.' (c.f. ?precede). This approach is essentially step 1 of my two step outline. Step 2 is intended to adjust for the case of overlapping ranges in subj. >giving the indexes into subj as > > > f1 > [1] 3 4 5 > > f2 > [1] 2 3 4 > >I guess there are problems when there is no following range (so need to check >for NA's before doing the second follow? > >Another strategy might simply > > sort(subj) > >and then f2 = f1 - 1 (except when seqnames(subj)[f1] != seqnames(subj)[f2]). ...except also when subj[f1] overlaps subj[f1-1] but I like the insight. >Probably strand will confusing things! Precede and follow optionally take strand into account with ignore.strand. >Martin > >> >> 2) use the range or these candidates and search using findOverlaps within them to see if there are others that were skipped Actually implementing this is tricky. I really suggest you think exactly what your requirements are regarding proximity between query and subject, and overlap between subjects before you go down this route. ~Malcolm >> >> Cheers, >> >> Malcolm >> >> From: d r [mailto:dolevrahat at gmail.com] >> Sent: Wednesday, August 21, 2013 2:57 PM >> To: Cook, Malcolm >> Cc: Steve Lianoglou; bioconductor at r-project.org list >> Subject: Re: [BioC] Using precede()/follow() to find two ranges >> >> Hi Malcolm >> Thanks for your thoughtfulness. >> For the specific problem I'm working on I intend to consider only the TSSs of the genes, which I will compare against illumina 450k >probes, which are also single nucleotide ranges. >> Therefore, I do not think that overlapping ranges should be an issue here. Of course, it is theoretically possible that a probe may >overlap a TSS, but since I'm looking for regulatory elements I am not interested in this type of associations anyhow. >> So, any insight you may be able to offer on this problem will be highly appreciated. >> >> However, I think that for my particular problem overlapping ranges should not be an issue. >> >> On Wed, Aug 21, 2013 at 8:59 PM, Cook, Malcolm <mec at="" stowers.org<mailto:mec="" at="" stowers.org="">> wrote: >> Dolev, >> >> Before chiming in on the problem as it is currently framed, I want to make sure of something. >> >> The documentation for precede and follow read 'Overlapping ranges are excluded.'. >> >> For instance if one of your cg ranges were to overlap one of your gene ranges, any method based on precede/follow will not >discover this association. >> >> Is this really desirable for your application? >> >> -Malcolm >> >> >> >-----Original Message----- >> >From: bioconductor-bounces at r-project.org<mailto :bioconductor-bounces="" at="" r-project.org=""> [mailto:bioconductor-bounces at r- >project.org<mailto:bioconductor-bounces at="" r-project.org="">] On Behalf Of d r >> >Sent: Wednesday, August 21, 2013 12:15 PM >> >To: Steve Lianoglou >> >Cc: bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> list >> >Subject: Re: [BioC] Using precede()/follow() to find two ranges >> > >> >Hi Steve and all >> > >> >Thanks for your suggestion. I was in fact thinking in this direction. >> >However, by completly discarding the first set of hits before querying the >> >second time I run the risk of missing ranges that may be potential second >> >hits beacuse they were discarded after being a first hit for other ranges. >> >For examle, if I have these two GRanges: >> > >> >gr1: >> > >> >GRanges with 2 ranges and 1 metadata column: >> > >> > seqnames ranges strand | names >> > >> > <rle> <iranges> <rle> | <factor> >> > >> > [1] chr6 [125284212, 125284212] * | cg00991794 >> > >> > [2] chr6 [150465049, 150465049] * | cg02250071 >> > >> > >> > >> > >> > >> >gr2: >> > >> >GRanges with 4 ranges and 1 metadata column: >> > >> > seqnames ranges strand | gene >> > >> > <rle> <iranges> <rle> | <factor> >> > >> > [1] chr6 [126284212, 124284212] + | PGM3 >> > >> > [2] chr6 [160920998, 150920998] + | PLEKHG1 >> > >> > [3] chr6 [ 83903012, 83903012] + | PGM3 >> > >> > [4] chr6 [190102159, 170102159] + | WDR27 >> > >> >The first hits will be gr2[1] and gr2[2], which will both be discarded. >> > >> >Now if I call precede() again the hits I will get will be gr2[3] and >> >gr[4], instead of getting again gr2[1] for gr1[2] and gr2[2] for >> >gr1[1]. >> > >> > >> >I was thinking that applying a function that will do what Steve >> >suggested might do the trick if it can run on one range of gr1 at a >> >time without modifying gr2 >> > >> >something along the lines of: >> > >> >two_hits_apply<-function(gr1,gr2) >> >{ >> > >> >p1<-precede(gr1,gr2) >> > >> >gr2.less<-gr2[-p1] >> > >> >p2<-precede(gr1,gr2.less) >> > >> >hits<-c(p1,p2) >> > >> >hits >> > >> >} >> > >> >now all I need is a way to apply a functuon over two GRanegs object. >> >If I get it right, I will need somehting like mapply(), but that can >> >actuaaly work on GRanges. >> > >> >Is such a function exists, or alternativly, is there a way to do this >> >with the convential apply functions that I miss? >> > >> >Many thanks in advance >> >Dolev >> > >> > >> > >> > >> > >> > >> > >> >On Wed, Aug 21, 2013 at 7:10 PM, Steve Lianoglou >> ><lianoglou.steve at="" gene.com<mailto:lianoglou.steve="" at="" gene.com="">>wrote: >> > >> >> Hi, >> >> >> >> On Wed, Aug 21, 2013 at 7:25 AM, d r <dolevrahat at="" gmail.com<mailto:dolevrahat="" at="" gmail.com="">> wrote: >> >> > Hello >> >> > >> >> > I am looking for a way to find the two preceiding/folliwng ranges in one >> >> > GRanges object to each range in a second GRanges object. >> >> > >> >> > In other words, I am looking for a variation on precede() or follow() >> >> that >> >> > will return for each range in x two ranges in subject instead of one: the >> >> > nearest preceidng(following) range and the second nearest preceding >> >> > (following) range. >> >> > >> >> > Is there a way to do this? (sorry for not giving any more constructive >> >> > suggestions, I know relativily little on how GRanges works any therefore >> >> > am at a loss as to how to proceed) >> >> >> >> A first draft way of doing that would be to simply use the results >> >> from the first precede to remove those elements from the second call? >> >> >> >> For instance, if you have two ranges you are querying, gr1 and gr2: >> >> >> >> To get the immediately preceding ranges: >> >> >> >> R> p1 <- precede(gr1, gr2) >> >> >> >> Then to get the ones immediately after that: >> >> >> >> R> gr2.less <- gr2[-p1] >> >> R> p2 <- precede(gr1, gr2.less) >> >> >> >> Then you can see who is who with `gr2[p1]` and the who-who is who with >> >> `gr2.less[p2]` ... that should get you pretty close -- will likely >> >> have to handle edge cases, for instance when there are no preceding >> >> ranges, I think you get an NA (if I recall) so think about what you >> >> want to do with those. >> >> >> >> -steve >> >> >> >> -- >> >> Steve Lianoglou >> >> Computational Biologist >> >> Bioinformatics and Computational Biology >> >> Genentech >> >> >> > >> > [[alternative HTML version deleted]] >> > >> >_______________________________________________ >> >Bioconductor mailing list >> >Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> >> >https://stat.ethz.ch/mailman/listinfo/bioconductor >> >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > >-- >Computational Biology / Fred Hutchinson Cancer Research Center >1100 Fairview Ave. N. >PO Box 19024 Seattle, WA 98109 > >Location: Arnold Building M1 B861 >Phone: (206) 667-2793

ADD REPLY • link 10.7 years ago Malcolm Cook ★ 1.6k

Login before adding your answer.