distances for IRanges

0

Entering edit mode

Kasper Daniel Hansen ★ 6.5k

@kasper-daniel-hansen-2979

Last seen 10 months ago

United States

Assuming I have two IRanges, each with multiple ranges, like ir1 = IRanges(start = 3:6, width = 2) ir2 = IRanges(start = 10:17, width = 2) Is there a fast way to compute a pairwise distance matrix between the two sets, by which I mean ii = 1 jj = 2 width(gaps(c(ir1[ii], ir2[jj]))) where ii, jj would index into a result matrix. Essentially this would be an expanded version of findOverlaps, since any two ranges with distance = 0, have an overlap. Is such functionality available in IRanges, in an efficient implementation (think of the case where the two IRanges have - say - 10,000 ranges or more)? Kasper

IRanges IRanges • 1.3k views

ADD COMMENT • link updated 13.9 years ago by Michael Lawrence ★ 11k • written 13.9 years ago by Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

For all pairwise distances, something simple based on outer() should suffice. It might not be very space efficient, but speed should be somewhat close to optimal. What is the end goal of this? For example, the nearest() function finds nearest neighbors efficiently. You might be able to leverage findOverlaps(). For example, one can set the maximum gap between ranges to be considered overlapping. That could be set to a non-zero value representing some maximum allowable distance. The sparse doublet matrix from as.matrix() would be pretty efficient for distance calculation, via the pgap() function. Michael On Tue, Jun 8, 2010 at 8:51 AM, Kasper Daniel Hansen < kasperdanielhansen@gmail.com> wrote: > Assuming I have two IRanges, each with multiple ranges, like > ir1 = IRanges(start = 3:6, width = 2) > ir2 = IRanges(start = 10:17, width = 2) > > Is there a fast way to compute a pairwise distance matrix between the > two sets, by which I mean > ii = 1 > jj = 2 > width(gaps(c(ir1[ii], ir2[jj]))) > where ii, jj would index into a result matrix. Essentially this would > be an expanded version of findOverlaps, since any two ranges with > distance = 0, have an overlap. > > Is such functionality available in IRanges, in an efficient > implementation (think of the case where the two IRanges have - say - > 10,000 ranges or more)? > > Kasper > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 13.9 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Michael Michael, Thanks for pointing out nearest and friends; I agree that this function should address my question. Reading the man page for nearest function, might I suggest an additional argument like multihits = c("arbitrary", "all") with the intention that a user can get full information in case one range overlaps (or ties in distance) with multiple other ranges. The return value could be a sparse matrix, findOverlaps-like. I find it important to know about multiple hits, especially in the case when a range has multiple overlaps. Thanks, Kasper On Tue, Jun 8, 2010 at 1:25 PM, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: > For all pairwise distances, something simple based on outer() should > suffice. It might not be very space efficient, but speed should be somewhat > close to optimal. > > What is the end goal of this? For example, the nearest() function finds > nearest neighbors efficiently. > > You might be able to leverage findOverlaps(). For example, one can set the > maximum gap between ranges to be considered overlapping. That could be set > to a non-zero value representing some maximum allowable distance. The sparse > doublet matrix from as.matrix() would be pretty efficient for distance > calculation, via the pgap() function. > > Michael > > On Tue, Jun 8, 2010 at 8:51 AM, Kasper Daniel Hansen > <kasperdanielhansen at="" gmail.com=""> wrote: >> >> Assuming I have two IRanges, each with multiple ranges, like >> ?ir1 = IRanges(start = 3:6, width = 2) >> ?ir2 = IRanges(start = 10:17, width = 2) >> >> Is there a fast way to compute a pairwise distance matrix between the >> two sets, by which I mean >> ?ii = 1 >> ?jj = 2 >> ?width(gaps(c(ir1[ii], ir2[jj]))) >> where ii, jj would index into a result matrix. ?Essentially this would >> be an expanded version of findOverlaps, since any two ranges with >> distance = 0, have an overlap. >> >> Is such functionality available in IRanges, in an efficient >> implementation (think of the case where the two IRanges have - say - >> 10,000 ranges or more)? >> >> Kasper >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD REPLY • link 13.9 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

On 06/09/2010 02:36 AM, Kasper Daniel Hansen wrote: > Hi Michael > > Michael, > > Thanks for pointing out nearest and friends; I agree that this > function should address my question. > > Reading the man page for nearest function, might I suggest an > additional argument like > multihits = c("arbitrary", "all") yes thanks for the timely nearest hint; for me I was hoping that ties could be decided based on maximum overlap (though ties might still occur, and perhaps I could break ties myself if 'all' were returned without too much difficulty). Martin > with the intention that a user can get full information in case one > range overlaps (or ties in distance) with multiple other ranges. The > return value could be a sparse matrix, findOverlaps-like. I find it > important to know about multiple hits, especially in the case when a > range has multiple overlaps. > > Thanks, > Kasper > > On Tue, Jun 8, 2010 at 1:25 PM, Michael Lawrence > <lawrence.michael at="" gene.com=""> wrote: >> For all pairwise distances, something simple based on outer() should >> suffice. It might not be very space efficient, but speed should be somewhat >> close to optimal. >> >> What is the end goal of this? For example, the nearest() function finds >> nearest neighbors efficiently. >> >> You might be able to leverage findOverlaps(). For example, one can set the >> maximum gap between ranges to be considered overlapping. That could be set >> to a non-zero value representing some maximum allowable distance. The sparse >> doublet matrix from as.matrix() would be pretty efficient for distance >> calculation, via the pgap() function. >> >> Michael >> >> On Tue, Jun 8, 2010 at 8:51 AM, Kasper Daniel Hansen >> <kasperdanielhansen at="" gmail.com=""> wrote: >>> >>> Assuming I have two IRanges, each with multiple ranges, like >>> ir1 = IRanges(start = 3:6, width = 2) >>> ir2 = IRanges(start = 10:17, width = 2) >>> >>> Is there a fast way to compute a pairwise distance matrix between the >>> two sets, by which I mean >>> ii = 1 >>> jj = 2 >>> width(gaps(c(ir1[ii], ir2[jj]))) >>> where ii, jj would index into a result matrix. Essentially this would >>> be an expanded version of findOverlaps, since any two ranges with >>> distance = 0, have an overlap. >>> >>> Is such functionality available in IRanges, in an efficient >>> implementation (think of the case where the two IRanges have - say - >>> 10,000 ranges or more)? >>> >>> Kasper >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 13.9 years ago Martin Morgan 25k

0

Entering edit mode

On Tue, Jun 8, 2010 at 9:53 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: > On 06/09/2010 02:36 AM, Kasper Daniel Hansen wrote: >> Hi Michael >> >> Michael, >> >> Thanks for pointing out nearest and friends; I agree that this >> function should address my question. >> >> Reading the man page for nearest function, might I suggest an >> additional argument like >> ? multihits = c("arbitrary", "all") > > yes thanks for the timely nearest hint; for me I was hoping that ties > could be decided based on maximum overlap (though ties might still > occur, and perhaps I could break ties myself if 'all' were returned > without too much difficulty). I would strongly advocate just returning the multiple hits and let the user decide how to break ties. While I certainly see use cases where Martin's suggestion is reasonable, what happens if a range overlap two different ranges, which have very different width. Should "maximum overlap" be counted in number of integers (in which case you might be biased towards the longer of the two ranges) or should it be in %-length or ? I think allowing for all these possibilities would make for a very complicated interface with little gain. One might conceive of a filter function on top of the return object (which could work for findOverlaps as well), but as the filtering in some cases would depend on additional metadata (say the ranges are of different type and there is an ordering on the type so an overlap with a certain type trumps another type), it might be a lot of infrastructure with very little gain. And really, the filtering would just be an apply over the sparse matrix. Kasper > > Martin > >> with the intention that a user can get full information in case one >> range overlaps (or ties in distance) with multiple other ranges. ?The >> return value could be a sparse matrix, findOverlaps-like. ?I find it >> important to know about multiple hits, especially in the case when a >> range has multiple overlaps. >> >> Thanks, >> Kasper >> >> On Tue, Jun 8, 2010 at 1:25 PM, Michael Lawrence >> <lawrence.michael at="" gene.com=""> wrote: >>> For all pairwise distances, something simple based on outer() should >>> suffice. It might not be very space efficient, but speed should be somewhat >>> close to optimal. >>> >>> What is the end goal of this? For example, the nearest() function finds >>> nearest neighbors efficiently. >>> >>> You might be able to leverage findOverlaps(). For example, one can set the >>> maximum gap between ranges to be considered overlapping. That could be set >>> to a non-zero value representing some maximum allowable distance. The sparse >>> doublet matrix from as.matrix() would be pretty efficient for distance >>> calculation, via the pgap() function. >>> >>> Michael >>> >>> On Tue, Jun 8, 2010 at 8:51 AM, Kasper Daniel Hansen >>> <kasperdanielhansen at="" gmail.com=""> wrote: >>>> >>>> Assuming I have two IRanges, each with multiple ranges, like >>>> ?ir1 = IRanges(start = 3:6, width = 2) >>>> ?ir2 = IRanges(start = 10:17, width = 2) >>>> >>>> Is there a fast way to compute a pairwise distance matrix between the >>>> two sets, by which I mean >>>> ?ii = 1 >>>> ?jj = 2 >>>> ?width(gaps(c(ir1[ii], ir2[jj]))) >>>> where ii, jj would index into a result matrix. ?Essentially this would >>>> be an expanded version of findOverlaps, since any two ranges with >>>> distance = 0, have an overlap. >>>> >>>> Is such functionality available in IRanges, in an efficient >>>> implementation (think of the case where the two IRanges have - say - >>>> 10,000 ranges or more)? >>>> >>>> Kasper >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Martin Morgan > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 >

ADD REPLY • link 13.9 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

On Tue, Jun 8, 2010 at 8:00 PM, Kasper Daniel Hansen < kasperdanielhansen@gmail.com> wrote: > On Tue, Jun 8, 2010 at 9:53 PM, Martin Morgan <mtmorgan@fhcrc.org> wrote: > > On 06/09/2010 02:36 AM, Kasper Daniel Hansen wrote: > >> Hi Michael > >> > >> Michael, > >> > >> Thanks for pointing out nearest and friends; I agree that this > >> function should address my question. > >> > >> Reading the man page for nearest function, might I suggest an > >> additional argument like > >> multihits = c("arbitrary", "all") > > > > yes thanks for the timely nearest hint; for me I was hoping that ties > > could be decided based on maximum overlap (though ties might still > > occur, and perhaps I could break ties myself if 'all' were returned > > without too much difficulty). > > I would strongly advocate just returning the multiple hits and let the > user decide how to break ties. While I certainly see use cases where > Martin's suggestion is reasonable, what happens if a range overlap two > different ranges, which have very different width. Should "maximum > overlap" be counted in number of integers (in which case you might be > biased towards the longer of the two ranges) or should it be in > %-length or ? I think allowing for all these possibilities would make > for a very complicated interface with little gain. > > One might conceive of a filter function on top of the return object > (which could work for findOverlaps as well), but as the filtering in > some cases would depend on additional metadata (say the ranges are of > different type and there is an ordering on the type so an overlap with > a certain type trumps another type), it might be a lot of > infrastructure with very little gain. And really, the filtering would > just be an apply over the sparse matrix. > > I agree that returning "all" is a good option to have, and it is consistent with the functionality of findOverlaps. If compelling common use cases emerge for tie resolution, the number of multihits options can be expanded. Thanks for the suggestion, Michael > Kasper > > > > > > Martin > > > >> with the intention that a user can get full information in case one > >> range overlaps (or ties in distance) with multiple other ranges. The > >> return value could be a sparse matrix, findOverlaps-like. I find it > >> important to know about multiple hits, especially in the case when a > >> range has multiple overlaps. > >> > >> Thanks, > >> Kasper > >> > >> On Tue, Jun 8, 2010 at 1:25 PM, Michael Lawrence > >> <lawrence.michael@gene.com> wrote: > >>> For all pairwise distances, something simple based on outer() should > >>> suffice. It might not be very space efficient, but speed should be > somewhat > >>> close to optimal. > >>> > >>> What is the end goal of this? For example, the nearest() function finds > >>> nearest neighbors efficiently. > >>> > >>> You might be able to leverage findOverlaps(). For example, one can set > the > >>> maximum gap between ranges to be considered overlapping. That could be > set > >>> to a non-zero value representing some maximum allowable distance. The > sparse > >>> doublet matrix from as.matrix() would be pretty efficient for distance > >>> calculation, via the pgap() function. > >>> > >>> Michael > >>> > >>> On Tue, Jun 8, 2010 at 8:51 AM, Kasper Daniel Hansen > >>> <kasperdanielhansen@gmail.com> wrote: > >>>> > >>>> Assuming I have two IRanges, each with multiple ranges, like > >>>> ir1 = IRanges(start = 3:6, width = 2) > >>>> ir2 = IRanges(start = 10:17, width = 2) > >>>> > >>>> Is there a fast way to compute a pairwise distance matrix between the > >>>> two sets, by which I mean > >>>> ii = 1 > >>>> jj = 2 > >>>> width(gaps(c(ir1[ii], ir2[jj]))) > >>>> where ii, jj would index into a result matrix. Essentially this would > >>>> be an expanded version of findOverlaps, since any two ranges with > >>>> distance = 0, have an overlap. > >>>> > >>>> Is such functionality available in IRanges, in an efficient > >>>> implementation (think of the case where the two IRanges have - say - > >>>> 10,000 ranges or more)? > >>>> > >>>> Kasper > >>>> > >>>> _______________________________________________ > >>>> Bioconductor mailing list > >>>> Bioconductor@stat.math.ethz.ch > >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>> Search the archives: > >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> > >>> > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@stat.math.ethz.ch > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > -- > > Martin Morgan > > Computational Biology / Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N. > > PO Box 19024 Seattle, WA 98109 > > > > Location: Arnold Building M1 B861 > > Phone: (206) 667-2793 > > > [[alternative HTML version deleted]]

ADD REPLY • link 13.9 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Sorry it took so long but I just checked this in for version 1.7.8 On Tue, Jun 8, 2010 at 8:27 PM, Michael Lawrence <michafla@gene.com> wrote: > > > On Tue, Jun 8, 2010 at 8:00 PM, Kasper Daniel Hansen < > kasperdanielhansen@gmail.com> wrote: > >> On Tue, Jun 8, 2010 at 9:53 PM, Martin Morgan <mtmorgan@fhcrc.org> wrote: >> > On 06/09/2010 02:36 AM, Kasper Daniel Hansen wrote: >> >> Hi Michael >> >> >> >> Michael, >> >> >> >> Thanks for pointing out nearest and friends; I agree that this >> >> function should address my question. >> >> >> >> Reading the man page for nearest function, might I suggest an >> >> additional argument like >> >> multihits = c("arbitrary", "all") >> > >> > yes thanks for the timely nearest hint; for me I was hoping that ties >> > could be decided based on maximum overlap (though ties might still >> > occur, and perhaps I could break ties myself if 'all' were returned >> > without too much difficulty). >> >> I would strongly advocate just returning the multiple hits and let the >> user decide how to break ties. While I certainly see use cases where >> Martin's suggestion is reasonable, what happens if a range overlap two >> different ranges, which have very different width. Should "maximum >> overlap" be counted in number of integers (in which case you might be >> biased towards the longer of the two ranges) or should it be in >> %-length or ? I think allowing for all these possibilities would make >> for a very complicated interface with little gain. >> >> One might conceive of a filter function on top of the return object >> (which could work for findOverlaps as well), but as the filtering in >> some cases would depend on additional metadata (say the ranges are of >> different type and there is an ordering on the type so an overlap with >> a certain type trumps another type), it might be a lot of >> infrastructure with very little gain. And really, the filtering would >> just be an apply over the sparse matrix. >> >> > I agree that returning "all" is a good option to have, and it is consistent > with the functionality of findOverlaps. If compelling common use cases > emerge for tie resolution, the number of multihits options can be expanded. > > Thanks for the suggestion, > Michael > > >> Kasper >> >> >> > >> > Martin >> > >> >> with the intention that a user can get full information in case one >> >> range overlaps (or ties in distance) with multiple other ranges. The >> >> return value could be a sparse matrix, findOverlaps-like. I find it >> >> important to know about multiple hits, especially in the case when a >> >> range has multiple overlaps. >> >> >> >> Thanks, >> >> Kasper >> >> >> >> On Tue, Jun 8, 2010 at 1:25 PM, Michael Lawrence >> >> <lawrence.michael@gene.com> wrote: >> >>> For all pairwise distances, something simple based on outer() should >> >>> suffice. It might not be very space efficient, but speed should be >> somewhat >> >>> close to optimal. >> >>> >> >>> What is the end goal of this? For example, the nearest() function >> finds >> >>> nearest neighbors efficiently. >> >>> >> >>> You might be able to leverage findOverlaps(). For example, one can set >> the >> >>> maximum gap between ranges to be considered overlapping. That could be >> set >> >>> to a non-zero value representing some maximum allowable distance. The >> sparse >> >>> doublet matrix from as.matrix() would be pretty efficient for distance >> >>> calculation, via the pgap() function. >> >>> >> >>> Michael >> >>> >> >>> On Tue, Jun 8, 2010 at 8:51 AM, Kasper Daniel Hansen >> >>> <kasperdanielhansen@gmail.com> wrote: >> >>>> >> >>>> Assuming I have two IRanges, each with multiple ranges, like >> >>>> ir1 = IRanges(start = 3:6, width = 2) >> >>>> ir2 = IRanges(start = 10:17, width = 2) >> >>>> >> >>>> Is there a fast way to compute a pairwise distance matrix between the >> >>>> two sets, by which I mean >> >>>> ii = 1 >> >>>> jj = 2 >> >>>> width(gaps(c(ir1[ii], ir2[jj]))) >> >>>> where ii, jj would index into a result matrix. Essentially this >> would >> >>>> be an expanded version of findOverlaps, since any two ranges with >> >>>> distance = 0, have an overlap. >> >>>> >> >>>> Is such functionality available in IRanges, in an efficient >> >>>> implementation (think of the case where the two IRanges have - say - >> >>>> 10,000 ranges or more)? >> >>>> >> >>>> Kasper >> >>>> >> >>>> _______________________________________________ >> >>>> Bioconductor mailing list >> >>>> Bioconductor@stat.math.ethz.ch >> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >>>> Search the archives: >> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >>> >> >>> >> >> >> >> _______________________________________________ >> >> Bioconductor mailing list >> >> Bioconductor@stat.math.ethz.ch >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> > >> > -- >> > Martin Morgan >> > Computational Biology / Fred Hutchinson Cancer Research Center >> > 1100 Fairview Ave. N. >> > PO Box 19024 Seattle, WA 98109 >> > >> > Location: Arnold Building M1 B861 >> > Phone: (206) 667-2793 >> > >> > > [[alternative HTML version deleted]]

ADD REPLY • link 13.9 years ago Michael Lawrence ★ 11k

Login before adding your answer.