GenomicRanges::nearest (especially follow)
1
0
Entering edit mode
@harris-a-jaffee-3972
Last seen 10.1 years ago
United States
Apologies in advance if I just don't understand the ignore.strand switch, or perhaps GRanges objects. Also, I have not even tried to understand GenomicRanges:::.GenomicRanges_findPrecedeFollow I have the impression that a query with strand entirely "*" essentially implies ignore.strand, but this case here is handled correctly only if ignore.strand is TRUE (consistent with distanceToNearest but not distance, which seems correct): > x = GRanges(ranges=IRanges(start=5, end=5), seqnames="chr1", strand="*") > Y = GRanges(ranges=IRanges(start=c(6,7), end=c(6,7)), seqnames="chr1", strand=c("+","-")) > distance(x, Y[1]) [1] 1 > distance(x, Y[2]) [1] 2 > nearest(x, Y, ignore.strand=TRUE) # correct [1] 1 > nearest(ranges(x), ranges(Y)) # also correct [1] 1 However, > nearest(x, Y) [1] 2 > distanceToNearest(x, Y) DataFrame with 1 row and 3 columns queryHits subjectHits distance <integer> <integer> <integer> 1 1 2 2 > distanceToNearest(x, Y, ignore.strand=TRUE) DataFrame with 1 row and 3 columns queryHits subjectHits distance <integer> <integer> <integer> 1 1 1 1 Finally, > follow(ranges(x), ranges(Y)) [1] NA The issue (along with GenomicRanges:::.nearest) must come down to this: > follow(x, Y) # how can this be? [1] 2 whereas > follow(x, Y, ignore.strand=TRUE) # correct, I think [1] NA > sessionInfo() R Under development (unstable) (2012-10-10 r60908) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] GenomicRanges_1.11.0 IRanges_1.17.0 BiocGenerics_0.5.0 loaded via a namespace (and not attached): [1] parallel_2.16.0 stats4_2.16.0 tools_2.16.0
• 3.4k views
ADD COMMENT
0
Entering edit mode
@valerie-obenchain-4275
Last seen 2.9 years ago
United States
Hi Harris, On 10/11/2012 02:27 PM, Harris A. Jaffee wrote: > Apologies in advance if I just don't understand the ignore.strand switch, or > perhaps GRanges objects. Also, I have not even tried to understand > > GenomicRanges:::.GenomicRanges_findPrecedeFollow > > I have the impression that a query with strand entirely "*" essentially implies > ignore.strand, but this case here is handled correctly only if ignore.strand is > TRUE (consistent with distanceToNearest but not distance, which seems correct): > >> x = GRanges(ranges=IRanges(start=5, end=5), seqnames="chr1", strand="*") >> Y = GRanges(ranges=IRanges(start=c(6,7), end=c(6,7)), seqnames="chr1", strand=c("+","-")) >> distance(x, Y[1]) > [1] 1 >> distance(x, Y[2]) > [1] 2 > >> nearest(x, Y, ignore.strand=TRUE) # correct > [1] 1 >> nearest(ranges(x), ranges(Y)) # also correct > [1] 1 > > However, > >> nearest(x, Y) > [1] 2 Thanks for reporting this bug. Now fixed in GenomicRanges 1.11.3 and 1.10.2. The problem was in .nearest() where I computed the preceding and following distance of the ranges. I was missing abs() so the distance for the second range was -2 instead of 2. When checking to see which range was closest > 1 < -2 [1] FALSE so the second range won. >> distanceToNearest(x, Y) > DataFrame with 1 row and 3 columns > queryHits subjectHits distance > <integer> <integer> <integer> > 1 1 2 2 > >> distanceToNearest(x, Y, ignore.strand=TRUE) > DataFrame with 1 row and 3 columns > queryHits subjectHits distance > <integer> <integer> <integer> > 1 1 1 1 > > > Finally, > >> follow(ranges(x), ranges(Y)) > [1] NA > > The issue (along with GenomicRanges:::.nearest) must come down to this: > >> follow(x, Y) # how can this be? > [1] 2 This behavior is due to the fact that a '*' range can compare itself to either '+' or '-'. The 'x' is '*" located at position 5. When comparing it to the '+' Y[1] we think of both ranges as '+'. precede() and follow() locations are determined by moving from 3' to 5'. On '+' strand this is from left to right so 5 precedes 6. > precede(x, Y[1]) [1] 1 but does not follow it > follow(x, Y[1]) [1] NA When comparing the '*' to Y[2] we think of both ranges as '-'. On the '-' moving from 3' to 5' is right to left so the 5 follows 7 > follow(x, Y[2]) [1] 1 but does not precede it > precede(x, Y[2]) [1] NA When we set ignore.strand=TRUE, all ranges are considered '+'. Valerie > whereas > >> follow(x, Y, ignore.strand=TRUE) # correct, I think > [1] NA > >> sessionInfo() > R Under development (unstable) (2012-10-10 r60908) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C > [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 > [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices datasets utils methods base > > other attached packages: > [1] GenomicRanges_1.11.0 IRanges_1.17.0 BiocGenerics_0.5.0 > > loaded via a namespace (and not attached): > [1] parallel_2.16.0 stats4_2.16.0 tools_2.16.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
A few follow up items from an off-list conversation with Harris. 1) I reversed the use of 5' and3' in my response below. Sorry about that. 2) The primary landing page for ?precede is in IRanges. I've expanded the \seealso section of this page to point the user to the man page in GenomicRanges. 3) The precede/follow man page in GenomicRanges now specifically states that 5' to 3' is the relevant orientation for precede/follow. For those interested in taking a look, the man page has several aliases, ?'nearest-methods' ?'precede,GenomicRanges,GenomicRanges-method' ?'follow,GenomicRanges,GenomicRanges-method' Changes can be found in IRanges 1.17.3 and GenomicRanges 1.11.4. Thanks Harris. Valerie On 10/12/2012 04:17 PM, Valerie Obenchain wrote: > Hi Harris, > > On 10/11/2012 02:27 PM, Harris A. Jaffee wrote: >> Apologies in advance if I just don't understand the ignore.strand >> switch, or >> perhaps GRanges objects. Also, I have not even tried to understand >> >> GenomicRanges:::.GenomicRanges_findPrecedeFollow >> >> I have the impression that a query with strand entirely "*" >> essentially implies >> ignore.strand, but this case here is handled correctly only if >> ignore.strand is >> TRUE (consistent with distanceToNearest but not distance, which seems >> correct): >> >>> x = GRanges(ranges=IRanges(start=5, end=5), seqnames="chr1", >>> strand="*") >>> Y = GRanges(ranges=IRanges(start=c(6,7), end=c(6,7)), >>> seqnames="chr1", strand=c("+","-")) >>> distance(x, Y[1]) >> [1] 1 >>> distance(x, Y[2]) >> [1] 2 >> >>> nearest(x, Y, ignore.strand=TRUE) # correct >> [1] 1 >>> nearest(ranges(x), ranges(Y)) # also correct >> [1] 1 >> >> However, >> >>> nearest(x, Y) >> [1] 2 > > Thanks for reporting this bug. Now fixed in GenomicRanges 1.11.3 and > 1.10.2. > > The problem was in .nearest() where I computed the preceding and > following distance of the ranges. I was missing abs() so the distance > for the second range was -2 instead of 2. When checking to see which > range was closest > > > 1 < -2 > [1] FALSE > > so the second range won. >>> distanceToNearest(x, Y) >> DataFrame with 1 row and 3 columns >> queryHits subjectHits distance >> <integer> <integer> <integer> >> 1 1 2 2 >> >>> distanceToNearest(x, Y, ignore.strand=TRUE) >> DataFrame with 1 row and 3 columns >> queryHits subjectHits distance >> <integer> <integer> <integer> >> 1 1 1 1 >> >> >> Finally, >> >>> follow(ranges(x), ranges(Y)) >> [1] NA >> >> The issue (along with GenomicRanges:::.nearest) must come down to this: >> >>> follow(x, Y) # how can this be? >> [1] 2 > > This behavior is due to the fact that a '*' range can compare itself > to either '+' or '-'. The 'x' is '*" located at position 5. When > comparing it to the '+' Y[1] > we think of both ranges as '+'. precede() and follow() locations are > determined by moving from 3' to 5'. On '+' strand this is from left to > right so 5 precedes 6. > > > precede(x, Y[1]) > [1] 1 > > but does not follow it > > > follow(x, Y[1]) > [1] NA > > When comparing the '*' to Y[2] we think of both ranges as '-'. On the > '-' moving from 3' to 5' is right to left so the 5 follows 7 > > > follow(x, Y[2]) > [1] 1 > > but does not precede it > > > precede(x, Y[2]) > [1] NA > > When we set ignore.strand=TRUE, all ranges are considered '+'. > > Valerie >> whereas >> >>> follow(x, Y, ignore.strand=TRUE) # correct, I think >> [1] NA >> >>> sessionInfo() >> R Under development (unstable) (2012-10-10 r60908) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C >> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 >> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices datasets utils methods base >> >> other attached packages: >> [1] GenomicRanges_1.11.0 IRanges_1.17.0 BiocGenerics_0.5.0 >> >> loaded via a namespace (and not attached): >> [1] parallel_2.16.0 stats4_2.16.0 tools_2.16.0 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Fantastic, thank you. I'm not sure I knew about that page! For example, ?precede gives me the IRanges nearest-methods page, and only if I explicitly request ?'nearest-methods' can I choose the GenomicRanges page. That brings up another issue: ?distance again sends me to the IRanges nearest-methods help, and that only documents distanceToNearest, which is not enough. The spec for distance actually lives only on ?'precede,GenomicRanges,GenomicRanges-method', as far as I see. Finally, and admittedly a total edge case, but zero-width ranges appear to be mishandled, or not handled, by the IRanges distance code. I think they should be infinitely far from anything, or at least NA, and there should not be any nearest or distanceToNearest value for a zero-width query. > x IRanges of length 1 start end width [1] 2 1 0 > distance(x, x) [1] 1 > y IRanges of length 1 start end width [1] 3 2 0 > distance(x, y) [1] 2 > z IRanges of length 1 start end width [1] 3 3 1 > distance(x, z) [1] 2 On Oct 15, 2012, at 6:27 PM, Valerie Obenchain wrote: > A few follow up items from an off-list conversation with Harris. > > 1) I reversed the use of 5' and3' in my response below. Sorry about that. > > 2) The primary landing page for ?precede is in IRanges. I've expanded the \seealso section of this page to point the user to the man page in GenomicRanges. > > 3) The precede/follow man page in GenomicRanges now specifically states that 5' to 3' is the relevant orientation for precede/follow. For those interested in taking a look, the man page has several aliases, > > ?'nearest-methods' > ?'precede,GenomicRanges,GenomicRanges-method' > ?'follow,GenomicRanges,GenomicRanges-method' > > Changes can be found in IRanges 1.17.3 and GenomicRanges 1.11.4. > > Thanks Harris. > > Valerie > > On 10/12/2012 04:17 PM, Valerie Obenchain wrote: >> Hi Harris, >> >> On 10/11/2012 02:27 PM, Harris A. Jaffee wrote: >>> Apologies in advance if I just don't understand the ignore.strand switch, or >>> perhaps GRanges objects. Also, I have not even tried to understand >>> >>> GenomicRanges:::.GenomicRanges_findPrecedeFollow >>> >>> I have the impression that a query with strand entirely "*" essentially implies >>> ignore.strand, but this case here is handled correctly only if ignore.strand is >>> TRUE (consistent with distanceToNearest but not distance, which seems correct): >>> >>>> x = GRanges(ranges=IRanges(start=5, end=5), seqnames="chr1", strand="*") >>>> Y = GRanges(ranges=IRanges(start=c(6,7), end=c(6,7)), seqnames="chr1", strand=c("+","-")) >>>> distance(x, Y[1]) >>> [1] 1 >>>> distance(x, Y[2]) >>> [1] 2 >>> >>>> nearest(x, Y, ignore.strand=TRUE) # correct >>> [1] 1 >>>> nearest(ranges(x), ranges(Y)) # also correct >>> [1] 1 >>> >>> However, >>> >>>> nearest(x, Y) >>> [1] 2 >> >> Thanks for reporting this bug. Now fixed in GenomicRanges 1.11.3 and 1.10.2. >> >> The problem was in .nearest() where I computed the preceding and following distance of the ranges. I was missing abs() so the distance for the second range was -2 instead of 2. When checking to see which range was closest >> >> > 1 < -2 >> [1] FALSE >> >> so the second range won. >>>> distanceToNearest(x, Y) >>> DataFrame with 1 row and 3 columns >>> queryHits subjectHits distance >>> <integer> <integer> <integer> >>> 1 1 2 2 >>> >>>> distanceToNearest(x, Y, ignore.strand=TRUE) >>> DataFrame with 1 row and 3 columns >>> queryHits subjectHits distance >>> <integer> <integer> <integer> >>> 1 1 1 1 >>> >>> >>> Finally, >>> >>>> follow(ranges(x), ranges(Y)) >>> [1] NA >>> >>> The issue (along with GenomicRanges:::.nearest) must come down to this: >>> >>>> follow(x, Y) # how can this be? >>> [1] 2 >> >> This behavior is due to the fact that a '*' range can compare itself to either '+' or '-'. The 'x' is '*" located at position 5. When comparing it to the '+' Y[1] >> we think of both ranges as '+'. precede() and follow() locations are determined by moving from 3' to 5'. On '+' strand this is from left to right so 5 precedes 6. >> >> > precede(x, Y[1]) >> [1] 1 >> >> but does not follow it >> >> > follow(x, Y[1]) >> [1] NA >> >> When comparing the '*' to Y[2] we think of both ranges as '-'. On the '-' moving from 3' to 5' is right to left so the 5 follows 7 >> >> > follow(x, Y[2]) >> [1] 1 >> >> but does not precede it >> >> > precede(x, Y[2]) >> [1] NA >> >> When we set ignore.strand=TRUE, all ranges are considered '+'. >> >> Valerie >>> whereas >>> >>>> follow(x, Y, ignore.strand=TRUE) # correct, I think >>> [1] NA >>> >>>> sessionInfo() >>> R Under development (unstable) (2012-10-10 r60908) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C >>> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 >>> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices datasets utils methods base >>> >>> other attached packages: >>> [1] GenomicRanges_1.11.0 IRanges_1.17.0 BiocGenerics_0.5.0 >>> >>> loaded via a namespace (and not attached): >>> [1] parallel_2.16.0 stats4_2.16.0 tools_2.16.0 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
On 10/16/2012 09:25 AM, Harris A. Jaffee wrote: > Fantastic, thank you. I'm not sure I knew about that page! > > For example, ?precede gives me the IRanges nearest-methods page, > and only if I explicitly request ?'nearest-methods' can I choose > the GenomicRanges page. That brings up another issue: ?distance > again sends me to the IRanges nearest-methods help, and that only > documents distanceToNearest, which is not enough. The spec for > distance actually lives only on > ?'precede,GenomicRanges,GenomicRanges-method', as far as I see. I can update the IRanges man page to include distance(). > Finally, and admittedly a total edge case, but zero-width ranges > appear to be mishandled, or not handled, by the IRanges distance > code. I think they should be infinitely far from anything, or at > least NA, and there should not be any nearest or distanceToNearest > value for a zero-width query. cc'ing Herve and Michael for history. In general I agree that zero-width ranges should not have a distance from any other range. I favor returning NA instead of inf since the range (albeit zero-width) does have a location. One concern I have is that we use zero-width ranges to represent insertions. In that context we may want to compute the distance to the nearest insertion. I do not have a concrete use case - just thinking of what might come. I'd be interested in opinions on this. Valerie >> x > IRanges of length 1 > start end width > [1] 2 1 0 > >> distance(x, x) > [1] 1 > >> y > IRanges of length 1 > start end width > [1] 3 2 0 > >> distance(x, y) > [1] 2 > >> z > IRanges of length 1 > start end width > [1] 3 3 1 > >> distance(x, z) > [1] 2 > > On Oct 15, 2012, at 6:27 PM, Valerie Obenchain wrote: > >> A few follow up items from an off-list conversation with Harris. >> >> 1) I reversed the use of 5' and3' in my response below. Sorry about that. >> >> 2) The primary landing page for ?precede is in IRanges. I've expanded the \seealso section of this page to point the user to the man page in GenomicRanges. >> >> 3) The precede/follow man page in GenomicRanges now specifically states that 5' to 3' is the relevant orientation for precede/follow. For those interested in taking a look, the man page has several aliases, >> >> ?'nearest-methods' >> ?'precede,GenomicRanges,GenomicRanges-method' >> ?'follow,GenomicRanges,GenomicRanges-method' >> >> Changes can be found in IRanges 1.17.3 and GenomicRanges 1.11.4. >> >> Thanks Harris. >> >> Valerie >> >> On 10/12/2012 04:17 PM, Valerie Obenchain wrote: >>> Hi Harris, >>> >>> On 10/11/2012 02:27 PM, Harris A. Jaffee wrote: >>>> Apologies in advance if I just don't understand the ignore.strand switch, or >>>> perhaps GRanges objects. Also, I have not even tried to understand >>>> >>>> GenomicRanges:::.GenomicRanges_findPrecedeFollow >>>> >>>> I have the impression that a query with strand entirely "*" essentially implies >>>> ignore.strand, but this case here is handled correctly only if ignore.strand is >>>> TRUE (consistent with distanceToNearest but not distance, which seems correct): >>>> >>>>> x = GRanges(ranges=IRanges(start=5, end=5), seqnames="chr1", strand="*") >>>>> Y = GRanges(ranges=IRanges(start=c(6,7), end=c(6,7)), seqnames="chr1", strand=c("+","-")) >>>>> distance(x, Y[1]) >>>> [1] 1 >>>>> distance(x, Y[2]) >>>> [1] 2 >>>> >>>>> nearest(x, Y, ignore.strand=TRUE) # correct >>>> [1] 1 >>>>> nearest(ranges(x), ranges(Y)) # also correct >>>> [1] 1 >>>> >>>> However, >>>> >>>>> nearest(x, Y) >>>> [1] 2 >>> Thanks for reporting this bug. Now fixed in GenomicRanges 1.11.3 and 1.10.2. >>> >>> The problem was in .nearest() where I computed the preceding and following distance of the ranges. I was missing abs() so the distance for the second range was -2 instead of 2. When checking to see which range was closest >>> >>>> 1< -2 >>> [1] FALSE >>> >>> so the second range won. >>>>> distanceToNearest(x, Y) >>>> DataFrame with 1 row and 3 columns >>>> queryHits subjectHits distance >>>> <integer> <integer> <integer> >>>> 1 1 2 2 >>>> >>>>> distanceToNearest(x, Y, ignore.strand=TRUE) >>>> DataFrame with 1 row and 3 columns >>>> queryHits subjectHits distance >>>> <integer> <integer> <integer> >>>> 1 1 1 1 >>>> >>>> >>>> Finally, >>>> >>>>> follow(ranges(x), ranges(Y)) >>>> [1] NA >>>> >>>> The issue (along with GenomicRanges:::.nearest) must come down to this: >>>> >>>>> follow(x, Y) # how can this be? >>>> [1] 2 >>> This behavior is due to the fact that a '*' range can compare itself to either '+' or '-'. The 'x' is '*" located at position 5. When comparing it to the '+' Y[1] >>> we think of both ranges as '+'. precede() and follow() locations are determined by moving from 3' to 5'. On '+' strand this is from left to right so 5 precedes 6. >>> >>>> precede(x, Y[1]) >>> [1] 1 >>> >>> but does not follow it >>> >>>> follow(x, Y[1]) >>> [1] NA >>> >>> When comparing the '*' to Y[2] we think of both ranges as '-'. On the '-' moving from 3' to 5' is right to left so the 5 follows 7 >>> >>>> follow(x, Y[2]) >>> [1] 1 >>> >>> but does not precede it >>> >>>> precede(x, Y[2]) >>> [1] NA >>> >>> When we set ignore.strand=TRUE, all ranges are considered '+'. >>> >>> Valerie >>>> whereas >>>> >>>>> follow(x, Y, ignore.strand=TRUE) # correct, I think >>>> [1] NA >>>> >>>>> sessionInfo() >>>> R Under development (unstable) (2012-10-10 r60908) >>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>> >>>> locale: >>>> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C >>>> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 >>>> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 >>>> [7] LC_PAPER=C LC_NAME=C >>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices datasets utils methods base >>>> >>>> other attached packages: >>>> [1] GenomicRanges_1.11.0 IRanges_1.17.0 BiocGenerics_0.5.0 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] parallel_2.16.0 stats4_2.16.0 tools_2.16.0 >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Since these zero-width ranges do have a position, I'm not sure why we cannot calculate their distance. As long as we have an established rule about how we handle them. For example, is it just from the start? That's probably most intuitive. Michael On Tue, Oct 16, 2012 at 11:25 AM, Valerie Obenchain <vobencha@fhcrc.org>wrote: > On 10/16/2012 09:25 AM, Harris A. Jaffee wrote: > >> Fantastic, thank you. I'm not sure I knew about that page! >> >> For example, ?precede gives me the IRanges nearest-methods page, >> and only if I explicitly request ?'nearest-methods' can I choose >> the GenomicRanges page. That brings up another issue: ?distance >> again sends me to the IRanges nearest-methods help, and that only >> documents distanceToNearest, which is not enough. The spec for >> distance actually lives only on >> ?'precede,GenomicRanges,**GenomicRanges-method', as far as I see. >> > > I can update the IRanges man page to include distance(). > > Finally, and admittedly a total edge case, but zero-width ranges >> appear to be mishandled, or not handled, by the IRanges distance >> code. I think they should be infinitely far from anything, or at >> least NA, and there should not be any nearest or distanceToNearest >> value for a zero-width query. >> > cc'ing Herve and Michael for history. > > In general I agree that zero-width ranges should not have a distance from > any other range. I favor returning NA instead of inf since the range > (albeit zero-width) does have a location. > > One concern I have is that we use zero-width ranges to represent > insertions. In that context we may want to compute the distance to the > nearest insertion. I do not have a concrete use case - just thinking of > what might come. I'd be interested in opinions on this. > > Valerie > > > > x >>> >> IRanges of length 1 >> start end width >> [1] 2 1 0 >> >> distance(x, x) >>> >> [1] 1 >> >> y >>> >> IRanges of length 1 >> start end width >> [1] 3 2 0 >> >> distance(x, y) >>> >> [1] 2 >> >> z >>> >> IRanges of length 1 >> start end width >> [1] 3 3 1 >> >> distance(x, z) >>> >> [1] 2 >> >> On Oct 15, 2012, at 6:27 PM, Valerie Obenchain wrote: >> >> A few follow up items from an off-list conversation with Harris. >>> >>> 1) I reversed the use of 5' and3' in my response below. Sorry about that. >>> >>> 2) The primary landing page for ?precede is in IRanges. I've expanded >>> the \seealso section of this page to point the user to the man page in >>> GenomicRanges. >>> >>> 3) The precede/follow man page in GenomicRanges now specifically states >>> that 5' to 3' is the relevant orientation for precede/follow. For those >>> interested in taking a look, the man page has several aliases, >>> >>> ?'nearest-methods' >>> ?'precede,GenomicRanges,**GenomicRanges-method' >>> ?'follow,GenomicRanges,**GenomicRanges-method' >>> >>> Changes can be found in IRanges 1.17.3 and GenomicRanges 1.11.4. >>> >>> Thanks Harris. >>> >>> Valerie >>> >>> On 10/12/2012 04:17 PM, Valerie Obenchain wrote: >>> >>>> Hi Harris, >>>> >>>> On 10/11/2012 02:27 PM, Harris A. Jaffee wrote: >>>> >>>>> Apologies in advance if I just don't understand the ignore.strand >>>>> switch, or >>>>> perhaps GRanges objects. Also, I have not even tried to understand >>>>> >>>>> GenomicRanges:::.**GenomicRanges_**findPrecedeFollow >>>>> >>>>> I have the impression that a query with strand entirely "*" >>>>> essentially implies >>>>> ignore.strand, but this case here is handled correctly only if >>>>> ignore.strand is >>>>> TRUE (consistent with distanceToNearest but not distance, which seems >>>>> correct): >>>>> >>>>> x = GRanges(ranges=IRanges(start=**5, end=5), seqnames="chr1", >>>>>> strand="*") >>>>>> Y = GRanges(ranges=IRanges(start=**c(6,7), end=c(6,7)), >>>>>> seqnames="chr1", strand=c("+","-")) >>>>>> distance(x, Y[1]) >>>>>> >>>>> [1] 1 >>>>> >>>>>> distance(x, Y[2]) >>>>>> >>>>> [1] 2 >>>>> >>>>> nearest(x, Y, ignore.strand=TRUE) # correct >>>>>> >>>>> [1] 1 >>>>> >>>>>> nearest(ranges(x), ranges(Y)) # also correct >>>>>> >>>>> [1] 1 >>>>> >>>>> However, >>>>> >>>>> nearest(x, Y) >>>>>> >>>>> [1] 2 >>>>> >>>> Thanks for reporting this bug. Now fixed in GenomicRanges 1.11.3 and >>>> 1.10.2. >>>> >>>> The problem was in .nearest() where I computed the preceding and >>>> following distance of the ranges. I was missing abs() so the distance for >>>> the second range was -2 instead of 2. When checking to see which range was >>>> closest >>>> >>>> 1< -2 >>>>> >>>> [1] FALSE >>>> >>>> so the second range won. >>>> >>>>> distanceToNearest(x, Y) >>>>>> >>>>> DataFrame with 1 row and 3 columns >>>>> queryHits subjectHits distance >>>>> <integer> <integer> <integer> >>>>> 1 1 2 2 >>>>> >>>>> distanceToNearest(x, Y, ignore.strand=TRUE) >>>>>> >>>>> DataFrame with 1 row and 3 columns >>>>> queryHits subjectHits distance >>>>> <integer> <integer> <integer> >>>>> 1 1 1 1 >>>>> >>>>> >>>>> Finally, >>>>> >>>>> follow(ranges(x), ranges(Y)) >>>>>> >>>>> [1] NA >>>>> >>>>> The issue (along with GenomicRanges:::.nearest) must come down to this: >>>>> >>>>> follow(x, Y) # how can this be? >>>>>> >>>>> [1] 2 >>>>> >>>> This behavior is due to the fact that a '*' range can compare itself to >>>> either '+' or '-'. The 'x' is '*" located at position 5. When comparing >>>> it to the '+' Y[1] >>>> we think of both ranges as '+'. precede() and follow() locations are >>>> determined by moving from 3' to 5'. On '+' strand this is from left to >>>> right so 5 precedes 6. >>>> >>>> precede(x, Y[1]) >>>>> >>>> [1] 1 >>>> >>>> but does not follow it >>>> >>>> follow(x, Y[1]) >>>>> >>>> [1] NA >>>> >>>> When comparing the '*' to Y[2] we think of both ranges as '-'. On the >>>> '-' moving from 3' to 5' is right to left so the 5 follows 7 >>>> >>>> follow(x, Y[2]) >>>>> >>>> [1] 1 >>>> >>>> but does not precede it >>>> >>>> precede(x, Y[2]) >>>>> >>>> [1] NA >>>> >>>> When we set ignore.strand=TRUE, all ranges are considered '+'. >>>> >>>> Valerie >>>> >>>>> whereas >>>>> >>>>> follow(x, Y, ignore.strand=TRUE) # correct, I think >>>>>> >>>>> [1] NA >>>>> >>>>> sessionInfo() >>>>>> >>>>> R Under development (unstable) (2012-10-10 r60908) >>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>> >>>>> locale: >>>>> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C >>>>> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 >>>>> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 >>>>> [7] LC_PAPER=C LC_NAME=C >>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C >>>>> >>>>> attached base packages: >>>>> [1] stats graphics grDevices datasets utils methods base >>>>> >>>>> other attached packages: >>>>> [1] GenomicRanges_1.11.0 IRanges_1.17.0 BiocGenerics_0.5.0 >>>>> >>>>> loaded via a namespace (and not attached): >>>>> [1] parallel_2.16.0 stats4_2.16.0 tools_2.16.0 >>>>> >>>>> ______________________________**_________________ >>>>> Bioconductor mailing list >>>>> Bioconductor@r-project.org >>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: sta="" t.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>>> Search the archives: http://news.gmane.org/gmane.** >>>>> science.biology.informatics.**conductor<http: news.gmane.org="" gm="" ane.science.biology.informatics.conductor=""> >>>>> >>>> ______________________________**_________________ >>>> Bioconductor mailing list >>>> Bioconductor@r-project.org >>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat="" .ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>> Search the archives: http://news.gmane.org/gmane.** >>>> science.biology.informatics.**conductor<http: news.gmane.org="" gma="" ne.science.biology.informatics.conductor=""> >>>> >>> > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Any rule is fine with me. They would not occur for us except by mistake. I was basically raising an alert that the current situation is not quite complete. On Oct 16, 2012, at 2:53 PM, Michael Lawrence wrote: > Since these zero-width ranges do have a position, I'm not sure why we cannot calculate their distance. As long as we have an established rule about how we handle them. For example, is it just from the start? That's probably most intuitive. > > Michael > > On Tue, Oct 16, 2012 at 11:25 AM, Valerie Obenchain <vobencha at="" fhcrc.org=""> wrote: > On 10/16/2012 09:25 AM, Harris A. Jaffee wrote: > Fantastic, thank you. I'm not sure I knew about that page! > > For example, ?precede gives me the IRanges nearest-methods page, > and only if I explicitly request ?'nearest-methods' can I choose > the GenomicRanges page. That brings up another issue: ?distance > again sends me to the IRanges nearest-methods help, and that only > documents distanceToNearest, which is not enough. The spec for > distance actually lives only on > ?'precede,GenomicRanges,GenomicRanges-method', as far as I see. > > I can update the IRanges man page to include distance(). > > Finally, and admittedly a total edge case, but zero-width ranges > appear to be mishandled, or not handled, by the IRanges distance > code. I think they should be infinitely far from anything, or at > least NA, and there should not be any nearest or distanceToNearest > value for a zero-width query. > cc'ing Herve and Michael for history. > > In general I agree that zero-width ranges should not have a distance from any other range. I favor returning NA instead of inf since the range (albeit zero-width) does have a location. > > One concern I have is that we use zero-width ranges to represent insertions. In that context we may want to compute the distance to the nearest insertion. I do not have a concrete use case - just thinking of what might come. I'd be interested in opinions on this. > > Valerie > > > > x > IRanges of length 1 > start end width > [1] 2 1 0 > > distance(x, x) > [1] 1 > > y > IRanges of length 1 > start end width > [1] 3 2 0 > > distance(x, y) > [1] 2 > > z > IRanges of length 1 > start end width > [1] 3 3 1 > > distance(x, z) > [1] 2 > > On Oct 15, 2012, at 6:27 PM, Valerie Obenchain wrote: > > A few follow up items from an off-list conversation with Harris. > > 1) I reversed the use of 5' and3' in my response below. Sorry about that. > > 2) The primary landing page for ?precede is in IRanges. I've expanded the \seealso section of this page to point the user to the man page in GenomicRanges. > > 3) The precede/follow man page in GenomicRanges now specifically states that 5' to 3' is the relevant orientation for precede/follow. For those interested in taking a look, the man page has several aliases, > > ?'nearest-methods' > ?'precede,GenomicRanges,GenomicRanges-method' > ?'follow,GenomicRanges,GenomicRanges-method' > > Changes can be found in IRanges 1.17.3 and GenomicRanges 1.11.4. > > Thanks Harris. > > Valerie > > On 10/12/2012 04:17 PM, Valerie Obenchain wrote: > Hi Harris, > > On 10/11/2012 02:27 PM, Harris A. Jaffee wrote: > Apologies in advance if I just don't understand the ignore.strand switch, or > perhaps GRanges objects. Also, I have not even tried to understand > > GenomicRanges:::.GenomicRanges_findPrecedeFollow > > I have the impression that a query with strand entirely "*" essentially implies > ignore.strand, but this case here is handled correctly only if ignore.strand is > TRUE (consistent with distanceToNearest but not distance, which seems correct): > > x = GRanges(ranges=IRanges(start=5, end=5), seqnames="chr1", strand="*") > Y = GRanges(ranges=IRanges(start=c(6,7), end=c(6,7)), seqnames="chr1", strand=c("+","-")) > distance(x, Y[1]) > [1] 1 > distance(x, Y[2]) > [1] 2 > > nearest(x, Y, ignore.strand=TRUE) # correct > [1] 1 > nearest(ranges(x), ranges(Y)) # also correct > [1] 1 > > However, > > nearest(x, Y) > [1] 2 > Thanks for reporting this bug. Now fixed in GenomicRanges 1.11.3 and 1.10.2. > > The problem was in .nearest() where I computed the preceding and following distance of the ranges. I was missing abs() so the distance for the second range was -2 instead of 2. When checking to see which range was closest > > 1< -2 > [1] FALSE > > so the second range won. > distanceToNearest(x, Y) > DataFrame with 1 row and 3 columns > queryHits subjectHits distance > <integer> <integer> <integer> > 1 1 2 2 > > distanceToNearest(x, Y, ignore.strand=TRUE) > DataFrame with 1 row and 3 columns > queryHits subjectHits distance > <integer> <integer> <integer> > 1 1 1 1 > > > Finally, > > follow(ranges(x), ranges(Y)) > [1] NA > > The issue (along with GenomicRanges:::.nearest) must come down to this: > > follow(x, Y) # how can this be? > [1] 2 > This behavior is due to the fact that a '*' range can compare itself to either '+' or '-'. The 'x' is '*" located at position 5. When comparing it to the '+' Y[1] > we think of both ranges as '+'. precede() and follow() locations are determined by moving from 3' to 5'. On '+' strand this is from left to right so 5 precedes 6. > > precede(x, Y[1]) > [1] 1 > > but does not follow it > > follow(x, Y[1]) > [1] NA > > When comparing the '*' to Y[2] we think of both ranges as '-'. On the '-' moving from 3' to 5' is right to left so the 5 follows 7 > > follow(x, Y[2]) > [1] 1 > > but does not precede it > > precede(x, Y[2]) > [1] NA > > When we set ignore.strand=TRUE, all ranges are considered '+'. > > Valerie > whereas > > follow(x, Y, ignore.strand=TRUE) # correct, I think > [1] NA > > sessionInfo() > R Under development (unstable) (2012-10-10 r60908) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C > [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 > [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices datasets utils methods base > > other attached packages: > [1] GenomicRanges_1.11.0 IRanges_1.17.0 BiocGenerics_0.5.0 > > loaded via a namespace (and not attached): > [1] parallel_2.16.0 stats4_2.16.0 tools_2.16.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >
ADD REPLY
0
Entering edit mode
Hi Harris, Val, Michael, Right now, all those distances are equal to 8, which I think is what we want: distance(IRanges(1,2), IRanges(10,16)) distance(IRanges(10,16), IRanges(1,2)) distance(IRanges(1,2), IRanges(10,9)) distance(IRanges(10,9),IRanges(1,2)) distance(IRanges(3,2), IRanges(10,16)) distance(IRanges(10,16), IRanges(3,2)) distance(IRanges(3,2), IRanges(10,9)) distance(IRanges(10,9), IRanges(3,2)) Yes, from a mathematical point of view, the distance between a non-empty set and an empty set if not defined, but I'm not sure there would be much to gain in doing this. Some use cases would be needed. What's nice about the current behavior is that zero-width ranges are not treated any differently than other ranges. H. On 10/16/2012 12:21 PM, Harris A. Jaffee wrote: > Any rule is fine with me. They would not occur for us except by mistake. > I was basically raising an alert that the current situation is not quite > complete. > > On Oct 16, 2012, at 2:53 PM, Michael Lawrence wrote: > >> Since these zero-width ranges do have a position, I'm not sure why we cannot calculate their distance. As long as we have an established rule about how we handle them. For example, is it just from the start? That's probably most intuitive. >> >> Michael >> >> On Tue, Oct 16, 2012 at 11:25 AM, Valerie Obenchain <vobencha at="" fhcrc.org=""> wrote: >> On 10/16/2012 09:25 AM, Harris A. Jaffee wrote: >> Fantastic, thank you. I'm not sure I knew about that page! >> >> For example, ?precede gives me the IRanges nearest-methods page, >> and only if I explicitly request ?'nearest-methods' can I choose >> the GenomicRanges page. That brings up another issue: ?distance >> again sends me to the IRanges nearest-methods help, and that only >> documents distanceToNearest, which is not enough. The spec for >> distance actually lives only on >> ?'precede,GenomicRanges,GenomicRanges-method', as far as I see. >> >> I can update the IRanges man page to include distance(). >> >> Finally, and admittedly a total edge case, but zero-width ranges >> appear to be mishandled, or not handled, by the IRanges distance >> code. I think they should be infinitely far from anything, or at >> least NA, and there should not be any nearest or distanceToNearest >> value for a zero-width query. >> cc'ing Herve and Michael for history. >> >> In general I agree that zero-width ranges should not have a distance from any other range. I favor returning NA instead of inf since the range (albeit zero-width) does have a location. >> >> One concern I have is that we use zero-width ranges to represent insertions. In that context we may want to compute the distance to the nearest insertion. I do not have a concrete use case - just thinking of what might come. I'd be interested in opinions on this. >> >> Valerie >> >> >> >> x >> IRanges of length 1 >> start end width >> [1] 2 1 0 >> >> distance(x, x) >> [1] 1 >> >> y >> IRanges of length 1 >> start end width >> [1] 3 2 0 >> >> distance(x, y) >> [1] 2 >> >> z >> IRanges of length 1 >> start end width >> [1] 3 3 1 >> >> distance(x, z) >> [1] 2 >> >> On Oct 15, 2012, at 6:27 PM, Valerie Obenchain wrote: >> >> A few follow up items from an off-list conversation with Harris. >> >> 1) I reversed the use of 5' and3' in my response below. Sorry about that. >> >> 2) The primary landing page for ?precede is in IRanges. I've expanded the \seealso section of this page to point the user to the man page in GenomicRanges. >> >> 3) The precede/follow man page in GenomicRanges now specifically states that 5' to 3' is the relevant orientation for precede/follow. For those interested in taking a look, the man page has several aliases, >> >> ?'nearest-methods' >> ?'precede,GenomicRanges,GenomicRanges-method' >> ?'follow,GenomicRanges,GenomicRanges-method' >> >> Changes can be found in IRanges 1.17.3 and GenomicRanges 1.11.4. >> >> Thanks Harris. >> >> Valerie >> >> On 10/12/2012 04:17 PM, Valerie Obenchain wrote: >> Hi Harris, >> >> On 10/11/2012 02:27 PM, Harris A. Jaffee wrote: >> Apologies in advance if I just don't understand the ignore.strand switch, or >> perhaps GRanges objects. Also, I have not even tried to understand >> >> GenomicRanges:::.GenomicRanges_findPrecedeFollow >> >> I have the impression that a query with strand entirely "*" essentially implies >> ignore.strand, but this case here is handled correctly only if ignore.strand is >> TRUE (consistent with distanceToNearest but not distance, which seems correct): >> >> x = GRanges(ranges=IRanges(start=5, end=5), seqnames="chr1", strand="*") >> Y = GRanges(ranges=IRanges(start=c(6,7), end=c(6,7)), seqnames="chr1", strand=c("+","-")) >> distance(x, Y[1]) >> [1] 1 >> distance(x, Y[2]) >> [1] 2 >> >> nearest(x, Y, ignore.strand=TRUE) # correct >> [1] 1 >> nearest(ranges(x), ranges(Y)) # also correct >> [1] 1 >> >> However, >> >> nearest(x, Y) >> [1] 2 >> Thanks for reporting this bug. Now fixed in GenomicRanges 1.11.3 and 1.10.2. >> >> The problem was in .nearest() where I computed the preceding and following distance of the ranges. I was missing abs() so the distance for the second range was -2 instead of 2. When checking to see which range was closest >> >> 1< -2 >> [1] FALSE >> >> so the second range won. >> distanceToNearest(x, Y) >> DataFrame with 1 row and 3 columns >> queryHits subjectHits distance >> <integer> <integer> <integer> >> 1 1 2 2 >> >> distanceToNearest(x, Y, ignore.strand=TRUE) >> DataFrame with 1 row and 3 columns >> queryHits subjectHits distance >> <integer> <integer> <integer> >> 1 1 1 1 >> >> >> Finally, >> >> follow(ranges(x), ranges(Y)) >> [1] NA >> >> The issue (along with GenomicRanges:::.nearest) must come down to this: >> >> follow(x, Y) # how can this be? >> [1] 2 >> This behavior is due to the fact that a '*' range can compare itself to either '+' or '-'. The 'x' is '*" located at position 5. When comparing it to the '+' Y[1] >> we think of both ranges as '+'. precede() and follow() locations are determined by moving from 3' to 5'. On '+' strand this is from left to right so 5 precedes 6. >> >> precede(x, Y[1]) >> [1] 1 >> >> but does not follow it >> >> follow(x, Y[1]) >> [1] NA >> >> When comparing the '*' to Y[2] we think of both ranges as '-'. On the '-' moving from 3' to 5' is right to left so the 5 follows 7 >> >> follow(x, Y[2]) >> [1] 1 >> >> but does not precede it >> >> precede(x, Y[2]) >> [1] NA >> >> When we set ignore.strand=TRUE, all ranges are considered '+'. >> >> Valerie >> whereas >> >> follow(x, Y, ignore.strand=TRUE) # correct, I think >> [1] NA >> >> sessionInfo() >> R Under development (unstable) (2012-10-10 r60908) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C >> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 >> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices datasets utils methods base >> >> other attached packages: >> [1] GenomicRanges_1.11.0 IRanges_1.17.0 BiocGenerics_0.5.0 >> >> loaded via a namespace (and not attached): >> [1] parallel_2.16.0 stats4_2.16.0 tools_2.16.0 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
As I said, any rule is fine, especially if they can serve a purpose via their "location". But you also have this, which seems weird: > distance(IRanges(2,1), IRanges(2,1)) [1] 1 To say they are not treated any differently is a big stretch to me, since their defining formula (end = start-1) is already a violation of my senses. [I wonder if SEW = (start, NA, 0) wouldn't have been more useful.] Just so we don't continue off into useless theology, I should say that I started looking into the distance code with the idea of adding a "signed" option (say, distance(x,y)<0 iff x is downstream from y), and I wanted to understand the landscape first. On Oct 17, 2012, at 12:56 AM, Hervé Pagès wrote: > Hi Harris, Val, Michael, > > Right now, all those distances are equal to 8, which I think is what > we want: > > distance(IRanges(1,2), IRanges(10,16)) > distance(IRanges(10,16), IRanges(1,2)) > distance(IRanges(1,2), IRanges(10,9)) > distance(IRanges(10,9),IRanges(1,2)) > distance(IRanges(3,2), IRanges(10,16)) > distance(IRanges(10,16), IRanges(3,2)) > distance(IRanges(3,2), IRanges(10,9)) > distance(IRanges(10,9), IRanges(3,2)) > > Yes, from a mathematical point of view, the distance between a > non-empty set and an empty set if not defined, but I'm not sure there > would be much to gain in doing this. Some use cases would be needed. > What's nice about the current behavior is that zero-width ranges are > not treated any differently than other ranges. > > H. > > > On 10/16/2012 12:21 PM, Harris A. Jaffee wrote: >> Any rule is fine with me. They would not occur for us except by mistake. >> I was basically raising an alert that the current situation is not quite >> complete. >> >> On Oct 16, 2012, at 2:53 PM, Michael Lawrence wrote: >> >>> Since these zero-width ranges do have a position, I'm not sure why we cannot calculate their distance. As long as we have an established rule about how we handle them. For example, is it just from the start? That's probably most intuitive. >>> >>> Michael >>> >>> On Tue, Oct 16, 2012 at 11:25 AM, Valerie Obenchain <vobencha at="" fhcrc.org=""> wrote: >>> On 10/16/2012 09:25 AM, Harris A. Jaffee wrote: >>> Fantastic, thank you. I'm not sure I knew about that page! >>> >>> For example, ?precede gives me the IRanges nearest-methods page, >>> and only if I explicitly request ?'nearest-methods' can I choose >>> the GenomicRanges page. That brings up another issue: ?distance >>> again sends me to the IRanges nearest-methods help, and that only >>> documents distanceToNearest, which is not enough. The spec for >>> distance actually lives only on >>> ?'precede,GenomicRanges,GenomicRanges-method', as far as I see. >>> >>> I can update the IRanges man page to include distance(). >>> >>> Finally, and admittedly a total edge case, but zero-width ranges >>> appear to be mishandled, or not handled, by the IRanges distance >>> code. I think they should be infinitely far from anything, or at >>> least NA, and there should not be any nearest or distanceToNearest >>> value for a zero-width query. >>> cc'ing Herve and Michael for history. >>> >>> In general I agree that zero-width ranges should not have a distance from any other range. I favor returning NA instead of inf since the range (albeit zero-width) does have a location. >>> >>> One concern I have is that we use zero-width ranges to represent insertions. In that context we may want to compute the distance to the nearest insertion. I do not have a concrete use case - just thinking of what might come. I'd be interested in opinions on this. >>> >>> Valerie >>> >>> >>> >>> x >>> IRanges of length 1 >>> start end width >>> [1] 2 1 0 >>> >>> distance(x, x) >>> [1] 1 >>> >>> y >>> IRanges of length 1 >>> start end width >>> [1] 3 2 0 >>> >>> distance(x, y) >>> [1] 2 >>> >>> z >>> IRanges of length 1 >>> start end width >>> [1] 3 3 1 >>> >>> distance(x, z) >>> [1] 2 >>> >>> On Oct 15, 2012, at 6:27 PM, Valerie Obenchain wrote: >>> >>> A few follow up items from an off-list conversation with Harris. >>> >>> 1) I reversed the use of 5' and3' in my response below. Sorry about that. >>> >>> 2) The primary landing page for ?precede is in IRanges. I've expanded the \seealso section of this page to point the user to the man page in GenomicRanges. >>> >>> 3) The precede/follow man page in GenomicRanges now specifically states that 5' to 3' is the relevant orientation for precede/follow. For those interested in taking a look, the man page has several aliases, >>> >>> ?'nearest-methods' >>> ?'precede,GenomicRanges,GenomicRanges-method' >>> ?'follow,GenomicRanges,GenomicRanges-method' >>> >>> Changes can be found in IRanges 1.17.3 and GenomicRanges 1.11.4. >>> >>> Thanks Harris. >>> >>> Valerie >>> >>> On 10/12/2012 04:17 PM, Valerie Obenchain wrote: >>> Hi Harris, >>> >>> On 10/11/2012 02:27 PM, Harris A. Jaffee wrote: >>> Apologies in advance if I just don't understand the ignore.strand switch, or >>> perhaps GRanges objects. Also, I have not even tried to understand >>> >>> GenomicRanges:::.GenomicRanges_findPrecedeFollow >>> >>> I have the impression that a query with strand entirely "*" essentially implies >>> ignore.strand, but this case here is handled correctly only if ignore.strand is >>> TRUE (consistent with distanceToNearest but not distance, which seems correct): >>> >>> x = GRanges(ranges=IRanges(start=5, end=5), seqnames="chr1", strand="*") >>> Y = GRanges(ranges=IRanges(start=c(6,7), end=c(6,7)), seqnames="chr1", strand=c("+","-")) >>> distance(x, Y[1]) >>> [1] 1 >>> distance(x, Y[2]) >>> [1] 2 >>> >>> nearest(x, Y, ignore.strand=TRUE) # correct >>> [1] 1 >>> nearest(ranges(x), ranges(Y)) # also correct >>> [1] 1 >>> >>> However, >>> >>> nearest(x, Y) >>> [1] 2 >>> Thanks for reporting this bug. Now fixed in GenomicRanges 1.11.3 and 1.10.2. >>> >>> The problem was in .nearest() where I computed the preceding and following distance of the ranges. I was missing abs() so the distance for the second range was -2 instead of 2. When checking to see which range was closest >>> >>> 1< -2 >>> [1] FALSE >>> >>> so the second range won. >>> distanceToNearest(x, Y) >>> DataFrame with 1 row and 3 columns >>> queryHits subjectHits distance >>> <integer> <integer> <integer> >>> 1 1 2 2 >>> >>> distanceToNearest(x, Y, ignore.strand=TRUE) >>> DataFrame with 1 row and 3 columns >>> queryHits subjectHits distance >>> <integer> <integer> <integer> >>> 1 1 1 1 >>> >>> >>> Finally, >>> >>> follow(ranges(x), ranges(Y)) >>> [1] NA >>> >>> The issue (along with GenomicRanges:::.nearest) must come down to this: >>> >>> follow(x, Y) # how can this be? >>> [1] 2 >>> This behavior is due to the fact that a '*' range can compare itself to either '+' or '-'. The 'x' is '*" located at position 5. When comparing it to the '+' Y[1] >>> we think of both ranges as '+'. precede() and follow() locations are determined by moving from 3' to 5'. On '+' strand this is from left to right so 5 precedes 6. >>> >>> precede(x, Y[1]) >>> [1] 1 >>> >>> but does not follow it >>> >>> follow(x, Y[1]) >>> [1] NA >>> >>> When comparing the '*' to Y[2] we think of both ranges as '-'. On the '-' moving from 3' to 5' is right to left so the 5 follows 7 >>> >>> follow(x, Y[2]) >>> [1] 1 >>> >>> but does not precede it >>> >>> precede(x, Y[2]) >>> [1] NA >>> >>> When we set ignore.strand=TRUE, all ranges are considered '+'. >>> >>> Valerie >>> whereas >>> >>> follow(x, Y, ignore.strand=TRUE) # correct, I think >>> [1] NA >>> >>> sessionInfo() >>> R Under development (unstable) (2012-10-10 r60908) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C >>> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 >>> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices datasets utils methods base >>> >>> other attached packages: >>> [1] GenomicRanges_1.11.0 IRanges_1.17.0 BiocGenerics_0.5.0 >>> >>> loaded via a namespace (and not attached): >>> [1] parallel_2.16.0 stats4_2.16.0 tools_2.16.0 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
Hmm, that's odd. And it seems inconsistent with the usual distance() and distanceToNearest() methods, which would return 0 in this case, yes? e.g. if the IRanges were GRanges and sitting on top of each other on the same strand, or if the first was distance()'ed against itself (the same thing) That doesn't make sense to me either. This doesn't seem like an issue of theology to me, but rather inconsistency with what the docs say will happen. On Wed, Oct 17, 2012 at 7:31 AM, Harris A. Jaffee <hj@jhu.edu> wrote: > As I said, any rule is fine, especially if they can serve a purpose > via their "location". But you also have this, which seems weird: > > > distance(IRanges(2,1), IRanges(2,1)) > [1] 1 > > To say they are not treated any differently is a big stretch to me, > since their defining formula (end = start-1) is already a violation > of my senses. [I wonder if SEW = (start, NA, 0) wouldn't have been > more useful.] > > Just so we don't continue off into useless theology, I should say that > I started looking into the distance code with the idea of adding a > "signed" option (say, distance(x,y)<0 iff x is downstream from y), and > I wanted to understand the landscape first. > > On Oct 17, 2012, at 12:56 AM, Hervé Pagès wrote: > > > Hi Harris, Val, Michael, > > > > Right now, all those distances are equal to 8, which I think is what > > we want: > > > > distance(IRanges(1,2), IRanges(10,16)) > > distance(IRanges(10,16), IRanges(1,2)) > > distance(IRanges(1,2), IRanges(10,9)) > > distance(IRanges(10,9),IRanges(1,2)) > > distance(IRanges(3,2), IRanges(10,16)) > > distance(IRanges(10,16), IRanges(3,2)) > > distance(IRanges(3,2), IRanges(10,9)) > > distance(IRanges(10,9), IRanges(3,2)) > > > > Yes, from a mathematical point of view, the distance between a > > non-empty set and an empty set if not defined, but I'm not sure there > > would be much to gain in doing this. Some use cases would be needed. > > What's nice about the current behavior is that zero-width ranges are > > not treated any differently than other ranges. > > > > H. > > > > > > On 10/16/2012 12:21 PM, Harris A. Jaffee wrote: > >> Any rule is fine with me. They would not occur for us except by > mistake. > >> I was basically raising an alert that the current situation is not quite > >> complete. > >> > >> On Oct 16, 2012, at 2:53 PM, Michael Lawrence wrote: > >> > >>> Since these zero-width ranges do have a position, I'm not sure why we > cannot calculate their distance. As long as we have an established rule > about how we handle them. For example, is it just from the start? That's > probably most intuitive. > >>> > >>> Michael > >>> > >>> On Tue, Oct 16, 2012 at 11:25 AM, Valerie Obenchain < > vobencha@fhcrc.org> wrote: > >>> On 10/16/2012 09:25 AM, Harris A. Jaffee wrote: > >>> Fantastic, thank you. I'm not sure I knew about that page! > >>> > >>> For example, ?precede gives me the IRanges nearest-methods page, > >>> and only if I explicitly request ?'nearest-methods' can I choose > >>> the GenomicRanges page. That brings up another issue: ?distance > >>> again sends me to the IRanges nearest-methods help, and that only > >>> documents distanceToNearest, which is not enough. The spec for > >>> distance actually lives only on > >>> ?'precede,GenomicRanges,GenomicRanges-method', as far as I see. > >>> > >>> I can update the IRanges man page to include distance(). > >>> > >>> Finally, and admittedly a total edge case, but zero-width ranges > >>> appear to be mishandled, or not handled, by the IRanges distance > >>> code. I think they should be infinitely far from anything, or at > >>> least NA, and there should not be any nearest or distanceToNearest > >>> value for a zero-width query. > >>> cc'ing Herve and Michael for history. > >>> > >>> In general I agree that zero-width ranges should not have a distance > from any other range. I favor returning NA instead of inf since the range > (albeit zero-width) does have a location. > >>> > >>> One concern I have is that we use zero-width ranges to represent > insertions. In that context we may want to compute the distance to the > nearest insertion. I do not have a concrete use case - just thinking of > what might come. I'd be interested in opinions on this. > >>> > >>> Valerie > >>> > >>> > >>> > >>> x > >>> IRanges of length 1 > >>> start end width > >>> [1] 2 1 0 > >>> > >>> distance(x, x) > >>> [1] 1 > >>> > >>> y > >>> IRanges of length 1 > >>> start end width > >>> [1] 3 2 0 > >>> > >>> distance(x, y) > >>> [1] 2 > >>> > >>> z > >>> IRanges of length 1 > >>> start end width > >>> [1] 3 3 1 > >>> > >>> distance(x, z) > >>> [1] 2 > >>> > >>> On Oct 15, 2012, at 6:27 PM, Valerie Obenchain wrote: > >>> > >>> A few follow up items from an off-list conversation with Harris. > >>> > >>> 1) I reversed the use of 5' and3' in my response below. Sorry about > that. > >>> > >>> 2) The primary landing page for ?precede is in IRanges. I've expanded > the \seealso section of this page to point the user to the man page in > GenomicRanges. > >>> > >>> 3) The precede/follow man page in GenomicRanges now specifically > states that 5' to 3' is the relevant orientation for precede/follow. For > those interested in taking a look, the man page has several aliases, > >>> > >>> ?'nearest-methods' > >>> ?'precede,GenomicRanges,GenomicRanges-method' > >>> ?'follow,GenomicRanges,GenomicRanges-method' > >>> > >>> Changes can be found in IRanges 1.17.3 and GenomicRanges 1.11.4. > >>> > >>> Thanks Harris. > >>> > >>> Valerie > >>> > >>> On 10/12/2012 04:17 PM, Valerie Obenchain wrote: > >>> Hi Harris, > >>> > >>> On 10/11/2012 02:27 PM, Harris A. Jaffee wrote: > >>> Apologies in advance if I just don't understand the ignore.strand > switch, or > >>> perhaps GRanges objects. Also, I have not even tried to understand > >>> > >>> GenomicRanges:::.GenomicRanges_findPrecedeFollow > >>> > >>> I have the impression that a query with strand entirely "*" > essentially implies > >>> ignore.strand, but this case here is handled correctly only if > ignore.strand is > >>> TRUE (consistent with distanceToNearest but not distance, which seems > correct): > >>> > >>> x = GRanges(ranges=IRanges(start=5, end=5), seqnames="chr1", > strand="*") > >>> Y = GRanges(ranges=IRanges(start=c(6,7), end=c(6,7)), seqnames="chr1", > strand=c("+","-")) > >>> distance(x, Y[1]) > >>> [1] 1 > >>> distance(x, Y[2]) > >>> [1] 2 > >>> > >>> nearest(x, Y, ignore.strand=TRUE) # correct > >>> [1] 1 > >>> nearest(ranges(x), ranges(Y)) # also correct > >>> [1] 1 > >>> > >>> However, > >>> > >>> nearest(x, Y) > >>> [1] 2 > >>> Thanks for reporting this bug. Now fixed in GenomicRanges 1.11.3 and > 1.10.2. > >>> > >>> The problem was in .nearest() where I computed the preceding and > following distance of the ranges. I was missing abs() so the distance for > the second range was -2 instead of 2. When checking to see which range was > closest > >>> > >>> 1< -2 > >>> [1] FALSE > >>> > >>> so the second range won. > >>> distanceToNearest(x, Y) > >>> DataFrame with 1 row and 3 columns > >>> queryHits subjectHits distance > >>> <integer> <integer> <integer> > >>> 1 1 2 2 > >>> > >>> distanceToNearest(x, Y, ignore.strand=TRUE) > >>> DataFrame with 1 row and 3 columns > >>> queryHits subjectHits distance > >>> <integer> <integer> <integer> > >>> 1 1 1 1 > >>> > >>> > >>> Finally, > >>> > >>> follow(ranges(x), ranges(Y)) > >>> [1] NA > >>> > >>> The issue (along with GenomicRanges:::.nearest) must come down to this: > >>> > >>> follow(x, Y) # how can this be? > >>> [1] 2 > >>> This behavior is due to the fact that a '*' range can compare itself > to either '+' or '-'. The 'x' is '*" located at position 5. When > comparing it to the '+' Y[1] > >>> we think of both ranges as '+'. precede() and follow() locations are > determined by moving from 3' to 5'. On '+' strand this is from left to > right so 5 precedes 6. > >>> > >>> precede(x, Y[1]) > >>> [1] 1 > >>> > >>> but does not follow it > >>> > >>> follow(x, Y[1]) > >>> [1] NA > >>> > >>> When comparing the '*' to Y[2] we think of both ranges as '-'. On the > '-' moving from 3' to 5' is right to left so the 5 follows 7 > >>> > >>> follow(x, Y[2]) > >>> [1] 1 > >>> > >>> but does not precede it > >>> > >>> precede(x, Y[2]) > >>> [1] NA > >>> > >>> When we set ignore.strand=TRUE, all ranges are considered '+'. > >>> > >>> Valerie > >>> whereas > >>> > >>> follow(x, Y, ignore.strand=TRUE) # correct, I think > >>> [1] NA > >>> > >>> sessionInfo() > >>> R Under development (unstable) (2012-10-10 r60908) > >>> Platform: x86_64-unknown-linux-gnu (64-bit) > >>> > >>> locale: > >>> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C > >>> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 > >>> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 > >>> [7] LC_PAPER=C LC_NAME=C > >>> [9] LC_ADDRESS=C LC_TELEPHONE=C > >>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C > >>> > >>> attached base packages: > >>> [1] stats graphics grDevices datasets utils methods base > >>> > >>> other attached packages: > >>> [1] GenomicRanges_1.11.0 IRanges_1.17.0 BiocGenerics_0.5.0 > >>> > >>> loaded via a namespace (and not attached): > >>> [1] parallel_2.16.0 stats4_2.16.0 tools_2.16.0 > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor@r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor@r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> > >>> > >> > > > > -- > > Hervé Pagès > > > > Program in Computational Biology > > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N, M1-B514 > > P.O. Box 19024 > > Seattle, WA 98109-1024 > > > > E-mail: hpages@fhcrc.org > > Phone: (206) 667-5791 > > Fax: (206) 667-1319 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Indeed, for distance to behave as a metric it must be that distance(x,x) == 0 for all x Here is some further (consistent) oddness: > compare(GRanges('x',IRanges(4,3)),GRanges('x',IRanges(4,3))) [1] -5 I would expect the value to be 0 here. Comparing x with itself should yield 0. After all they are identical! --Malcolm On 10/17/12 9:50 AM, "Tim Triche, Jr." <tim.triche at="" gmail.com=""> wrote: >Hmm, that's odd. And it seems inconsistent with the usual distance() and >distanceToNearest() methods, which would return 0 in this case, yes? > >e.g. if the IRanges were GRanges and sitting on top of each other on the >same strand, or if the first was distance()'ed against itself (the same >thing) > >That doesn't make sense to me either. This doesn't seem like an issue of >theology to me, but rather inconsistency with what the docs say will >happen. > > > >On Wed, Oct 17, 2012 at 7:31 AM, Harris A. Jaffee <hj at="" jhu.edu=""> wrote: > >> As I said, any rule is fine, especially if they can serve a purpose >> via their "location". But you also have this, which seems weird: >> >> > distance(IRanges(2,1), IRanges(2,1)) >> [1] 1 >> >> To say they are not treated any differently is a big stretch to me, >> since their defining formula (end = start-1) is already a violation >> of my senses. [I wonder if SEW = (start, NA, 0) wouldn't have been >> more useful.] >> >> Just so we don't continue off into useless theology, I should say that >> I started looking into the distance code with the idea of adding a >> "signed" option (say, distance(x,y)<0 iff x is downstream from y), and >> I wanted to understand the landscape first. >> >> On Oct 17, 2012, at 12:56 AM, Hervé Pagès wrote: >> >> > Hi Harris, Val, Michael, >> > >> > Right now, all those distances are equal to 8, which I think is what >> > we want: >> > >> > distance(IRanges(1,2), IRanges(10,16)) >> > distance(IRanges(10,16), IRanges(1,2)) >> > distance(IRanges(1,2), IRanges(10,9)) >> > distance(IRanges(10,9),IRanges(1,2)) >> > distance(IRanges(3,2), IRanges(10,16)) >> > distance(IRanges(10,16), IRanges(3,2)) >> > distance(IRanges(3,2), IRanges(10,9)) >> > distance(IRanges(10,9), IRanges(3,2)) >> > >> > Yes, from a mathematical point of view, the distance between a >> > non-empty set and an empty set if not defined, but I'm not sure there >> > would be much to gain in doing this. Some use cases would be needed. >> > What's nice about the current behavior is that zero-width ranges are >> > not treated any differently than other ranges. >> > >> > H. >> > >> > >> > On 10/16/2012 12:21 PM, Harris A. Jaffee wrote: >> >> Any rule is fine with me. They would not occur for us except by >> mistake. >> >> I was basically raising an alert that the current situation is not >>quite >> >> complete. >> >> >> >> On Oct 16, 2012, at 2:53 PM, Michael Lawrence wrote: >> >> >> >>> Since these zero-width ranges do have a position, I'm not sure why >>we >> cannot calculate their distance. As long as we have an established rule >> about how we handle them. For example, is it just from the start? That's >> probably most intuitive. >> >>> >> >>> Michael >> >>> >> >>> On Tue, Oct 16, 2012 at 11:25 AM, Valerie Obenchain < >> vobencha at fhcrc.org> wrote: >> >>> On 10/16/2012 09:25 AM, Harris A. Jaffee wrote: >> >>> Fantastic, thank you. I'm not sure I knew about that page! >> >>> >> >>> For example, ?precede gives me the IRanges nearest-methods page, >> >>> and only if I explicitly request ?'nearest-methods' can I choose >> >>> the GenomicRanges page. That brings up another issue: ?distance >> >>> again sends me to the IRanges nearest-methods help, and that only >> >>> documents distanceToNearest, which is not enough. The spec for >> >>> distance actually lives only on >> >>> ?'precede,GenomicRanges,GenomicRanges-method', as far as I see. >> >>> >> >>> I can update the IRanges man page to include distance(). >> >>> >> >>> Finally, and admittedly a total edge case, but zero-width ranges >> >>> appear to be mishandled, or not handled, by the IRanges distance >> >>> code. I think they should be infinitely far from anything, or at >> >>> least NA, and there should not be any nearest or distanceToNearest >> >>> value for a zero-width query. >> >>> cc'ing Herve and Michael for history. >> >>> >> >>> In general I agree that zero-width ranges should not have a distance >> from any other range. I favor returning NA instead of inf since the >>range >> (albeit zero-width) does have a location. >> >>> >> >>> One concern I have is that we use zero-width ranges to represent >> insertions. In that context we may want to compute the distance to the >> nearest insertion. I do not have a concrete use case - just thinking of >> what might come. I'd be interested in opinions on this. >> >>> >> >>> Valerie >> >>> >> >>> >> >>> >> >>> x >> >>> IRanges of length 1 >> >>> start end width >> >>> [1] 2 1 0 >> >>> >> >>> distance(x, x) >> >>> [1] 1 >> >>> >> >>> y >> >>> IRanges of length 1 >> >>> start end width >> >>> [1] 3 2 0 >> >>> >> >>> distance(x, y) >> >>> [1] 2 >> >>> >> >>> z >> >>> IRanges of length 1 >> >>> start end width >> >>> [1] 3 3 1 >> >>> >> >>> distance(x, z) >> >>> [1] 2 >> >>> >> >>> On Oct 15, 2012, at 6:27 PM, Valerie Obenchain wrote: >> >>> >> >>> A few follow up items from an off-list conversation with Harris. >> >>> >> >>> 1) I reversed the use of 5' and3' in my response below. Sorry about >> that. >> >>> >> >>> 2) The primary landing page for ?precede is in IRanges. I've >>expanded >> the \seealso section of this page to point the user to the man page in >> GenomicRanges. >> >>> >> >>> 3) The precede/follow man page in GenomicRanges now specifically >> states that 5' to 3' is the relevant orientation for precede/follow. For >> those interested in taking a look, the man page has several aliases, >> >>> >> >>> ?'nearest-methods' >> >>> ?'precede,GenomicRanges,GenomicRanges-method' >> >>> ?'follow,GenomicRanges,GenomicRanges-method' >> >>> >> >>> Changes can be found in IRanges 1.17.3 and GenomicRanges 1.11.4. >> >>> >> >>> Thanks Harris. >> >>> >> >>> Valerie >> >>> >> >>> On 10/12/2012 04:17 PM, Valerie Obenchain wrote: >> >>> Hi Harris, >> >>> >> >>> On 10/11/2012 02:27 PM, Harris A. Jaffee wrote: >> >>> Apologies in advance if I just don't understand the ignore.strand >> switch, or >> >>> perhaps GRanges objects. Also, I have not even tried to understand >> >>> >> >>> GenomicRanges:::.GenomicRanges_findPrecedeFollow >> >>> >> >>> I have the impression that a query with strand entirely "*" >> essentially implies >> >>> ignore.strand, but this case here is handled correctly only if >> ignore.strand is >> >>> TRUE (consistent with distanceToNearest but not distance, which >>seems >> correct): >> >>> >> >>> x = GRanges(ranges=IRanges(start=5, end=5), seqnames="chr1", >> strand="*") >> >>> Y = GRanges(ranges=IRanges(start=c(6,7), end=c(6,7)), >>seqnames="chr1", >> strand=c("+","-")) >> >>> distance(x, Y[1]) >> >>> [1] 1 >> >>> distance(x, Y[2]) >> >>> [1] 2 >> >>> >> >>> nearest(x, Y, ignore.strand=TRUE) # correct >> >>> [1] 1 >> >>> nearest(ranges(x), ranges(Y)) # also correct >> >>> [1] 1 >> >>> >> >>> However, >> >>> >> >>> nearest(x, Y) >> >>> [1] 2 >> >>> Thanks for reporting this bug. Now fixed in GenomicRanges 1.11.3 and >> 1.10.2. >> >>> >> >>> The problem was in .nearest() where I computed the preceding and >> following distance of the ranges. I was missing abs() so the distance >>for >> the second range was -2 instead of 2. When checking to see which range >>was >> closest >> >>> >> >>> 1< -2 >> >>> [1] FALSE >> >>> >> >>> so the second range won. >> >>> distanceToNearest(x, Y) >> >>> DataFrame with 1 row and 3 columns >> >>> queryHits subjectHits distance >> >>> <integer> <integer> <integer> >> >>> 1 1 2 2 >> >>> >> >>> distanceToNearest(x, Y, ignore.strand=TRUE) >> >>> DataFrame with 1 row and 3 columns >> >>> queryHits subjectHits distance >> >>> <integer> <integer> <integer> >> >>> 1 1 1 1 >> >>> >> >>> >> >>> Finally, >> >>> >> >>> follow(ranges(x), ranges(Y)) >> >>> [1] NA >> >>> >> >>> The issue (along with GenomicRanges:::.nearest) must come down to >>this: >> >>> >> >>> follow(x, Y) # how can this be? >> >>> [1] 2 >> >>> This behavior is due to the fact that a '*' range can compare itself >> to either '+' or '-'. The 'x' is '*" located at position 5. When >> comparing it to the '+' Y[1] >> >>> we think of both ranges as '+'. precede() and follow() locations are >> determined by moving from 3' to 5'. On '+' strand this is from left to >> right so 5 precedes 6. >> >>> >> >>> precede(x, Y[1]) >> >>> [1] 1 >> >>> >> >>> but does not follow it >> >>> >> >>> follow(x, Y[1]) >> >>> [1] NA >> >>> >> >>> When comparing the '*' to Y[2] we think of both ranges as '-'. On >>the >> '-' moving from 3' to 5' is right to left so the 5 follows 7 >> >>> >> >>> follow(x, Y[2]) >> >>> [1] 1 >> >>> >> >>> but does not precede it >> >>> >> >>> precede(x, Y[2]) >> >>> [1] NA >> >>> >> >>> When we set ignore.strand=TRUE, all ranges are considered '+'. >> >>> >> >>> Valerie >> >>> whereas >> >>> >> >>> follow(x, Y, ignore.strand=TRUE) # correct, I think >> >>> [1] NA >> >>> >> >>> sessionInfo() >> >>> R Under development (unstable) (2012-10-10 r60908) >> >>> Platform: x86_64-unknown-linux-gnu (64-bit) >> >>> >> >>> locale: >> >>> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C >> >>> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 >> >>> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 >> >>> [7] LC_PAPER=C LC_NAME=C >> >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >> >>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C >> >>> >> >>> attached base packages: >> >>> [1] stats graphics grDevices datasets utils methods base >> >>> >> >>> other attached packages: >> >>> [1] GenomicRanges_1.11.0 IRanges_1.17.0 BiocGenerics_0.5.0 >> >>> >> >>> loaded via a namespace (and not attached): >> >>> [1] parallel_2.16.0 stats4_2.16.0 tools_2.16.0 >> >>> >> >>> _______________________________________________ >> >>> Bioconductor mailing list >> >>> Bioconductor at r-project.org >> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >>> _______________________________________________ >> >>> Bioconductor mailing list >> >>> Bioconductor at r-project.org >> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >>> >> >>> >> >> >> > >> > -- >> > Hervé Pagès >> > >> > Program in Computational Biology >> > Division of Public Health Sciences >> > Fred Hutchinson Cancer Research Center >> > 1100 Fairview Ave. N, M1-B514 >> > P.O. Box 19024 >> > Seattle, WA 98109-1024 >> > >> > E-mail: hpages at fhcrc.org >> > Phone: (206) 667-5791 >> > Fax: (206) 667-1319 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > >-- >*A model is a lie that helps you see the truth.* >* >* >Howard >Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > > [[alternative HTML version deleted]] >
ADD REPLY
0
Entering edit mode
On Oct 17, 2012, at 11:31 AM, Cook, Malcolm wrote: > Indeed, for distance to behave as a metric it must be that distance(x,x) > == 0 for all x > > Here is some further (consistent) oddness: > >> compare(GRanges('x',IRanges(4,3)),GRanges('x',IRanges(4,3))) > [1] -5 Thanks for the tip on compare. Looks like a nice classification scheme. I should start reading vignettes! > I would expect the value to be 0 here. Comparing x with itself should > yield 0. After all they are identical! Yeah, identical but *empty*. Therefore, none of the scenarios depicted in the man page seems appropriate (even 0/g): -6 a: x[i]: .oooo....... 6 m: x[i]: .......oooo. y[i]: .......oooo. y[i]: .oooo....... -5 b: x[i]: ..oooo...... 5 l: x[i]: ......oooo.. y[i]: ......oooo.. y[i]: ..oooo...... -4 c: x[i]: ...oooo..... 4 k: x[i]: .....oooo... y[i]: .....oooo... y[i]: ...oooo..... -3 d: x[i]: ...oooooo... 3 j: x[i]: .....oooo... y[i]: .....oooo... y[i]: ...oooooo... -2 e: x[i]: ..oooooooo.. 2 i: x[i]: ....oooo.... y[i]: ....oooo.... y[i]: ..oooooooo.. -1 f: x[i]: ...oooo..... 1 h: x[i]: ...oooooo... y[i]: ...oooooo... y[i]: ...oooo..... 0 g: x[i]: ...oooooo... y[i]: ...oooooo... Again, I like NA in these cases. > --Malcolm > > > On 10/17/12 9:50 AM, "Tim Triche, Jr." <tim.triche at="" gmail.com=""> wrote: > >> Hmm, that's odd. And it seems inconsistent with the usual distance() and >> distanceToNearest() methods, which would return 0 in this case, yes? >> >> e.g. if the IRanges were GRanges and sitting on top of each other on the >> same strand, or if the first was distance()'ed against itself (the same >> thing) >> >> That doesn't make sense to me either. This doesn't seem like an issue of >> theology to me, but rather inconsistency with what the docs say will >> happen. >> >> >> >> On Wed, Oct 17, 2012 at 7:31 AM, Harris A. Jaffee <hj at="" jhu.edu=""> wrote: >> >>> As I said, any rule is fine, especially if they can serve a purpose >>> via their "location". But you also have this, which seems weird: >>> >>>> distance(IRanges(2,1), IRanges(2,1)) >>> [1] 1 >>> >>> To say they are not treated any differently is a big stretch to me, >>> since their defining formula (end = start-1) is already a violation >>> of my senses. [I wonder if SEW = (start, NA, 0) wouldn't have been >>> more useful.] >>> >>> Just so we don't continue off into useless theology, I should say that >>> I started looking into the distance code with the idea of adding a >>> "signed" option (say, distance(x,y)<0 iff x is downstream from y), and >>> I wanted to understand the landscape first. >>> >>> On Oct 17, 2012, at 12:56 AM, Hervé Pagès wrote: >>> >>>> Hi Harris, Val, Michael, >>>> >>>> Right now, all those distances are equal to 8, which I think is what >>>> we want: >>>> >>>> distance(IRanges(1,2), IRanges(10,16)) >>>> distance(IRanges(10,16), IRanges(1,2)) >>>> distance(IRanges(1,2), IRanges(10,9)) >>>> distance(IRanges(10,9),IRanges(1,2)) >>>> distance(IRanges(3,2), IRanges(10,16)) >>>> distance(IRanges(10,16), IRanges(3,2)) >>>> distance(IRanges(3,2), IRanges(10,9)) >>>> distance(IRanges(10,9), IRanges(3,2)) >>>> >>>> Yes, from a mathematical point of view, the distance between a >>>> non-empty set and an empty set if not defined, but I'm not sure there >>>> would be much to gain in doing this. Some use cases would be needed. >>>> What's nice about the current behavior is that zero-width ranges are >>>> not treated any differently than other ranges. >>>> >>>> H. >>>> >>>> >>>> On 10/16/2012 12:21 PM, Harris A. Jaffee wrote: >>>>> Any rule is fine with me. They would not occur for us except by >>> mistake. >>>>> I was basically raising an alert that the current situation is not >>> quite >>>>> complete. >>>>> >>>>> On Oct 16, 2012, at 2:53 PM, Michael Lawrence wrote: >>>>> >>>>>> Since these zero-width ranges do have a position, I'm not sure why >>> we >>> cannot calculate their distance. As long as we have an established rule >>> about how we handle them. For example, is it just from the start? That's >>> probably most intuitive. >>>>>> >>>>>> Michael >>>>>> >>>>>> On Tue, Oct 16, 2012 at 11:25 AM, Valerie Obenchain < >>> vobencha at fhcrc.org> wrote: >>>>>> On 10/16/2012 09:25 AM, Harris A. Jaffee wrote: >>>>>> Fantastic, thank you. I'm not sure I knew about that page! >>>>>> >>>>>> For example, ?precede gives me the IRanges nearest-methods page, >>>>>> and only if I explicitly request ?'nearest-methods' can I choose >>>>>> the GenomicRanges page. That brings up another issue: ?distance >>>>>> again sends me to the IRanges nearest-methods help, and that only >>>>>> documents distanceToNearest, which is not enough. The spec for >>>>>> distance actually lives only on >>>>>> ?'precede,GenomicRanges,GenomicRanges-method', as far as I see. >>>>>> >>>>>> I can update the IRanges man page to include distance(). >>>>>> >>>>>> Finally, and admittedly a total edge case, but zero-width ranges >>>>>> appear to be mishandled, or not handled, by the IRanges distance >>>>>> code. I think they should be infinitely far from anything, or at >>>>>> least NA, and there should not be any nearest or distanceToNearest >>>>>> value for a zero-width query. >>>>>> cc'ing Herve and Michael for history. >>>>>> >>>>>> In general I agree that zero-width ranges should not have a distance >>> from any other range. I favor returning NA instead of inf since the >>> range >>> (albeit zero-width) does have a location. >>>>>> >>>>>> One concern I have is that we use zero-width ranges to represent >>> insertions. In that context we may want to compute the distance to the >>> nearest insertion. I do not have a concrete use case - just thinking of >>> what might come. I'd be interested in opinions on this. >>>>>> >>>>>> Valerie >>>>>> >>>>>> >>>>>> >>>>>> x >>>>>> IRanges of length 1 >>>>>> start end width >>>>>> [1] 2 1 0 >>>>>> >>>>>> distance(x, x) >>>>>> [1] 1 >>>>>> >>>>>> y >>>>>> IRanges of length 1 >>>>>> start end width >>>>>> [1] 3 2 0 >>>>>> >>>>>> distance(x, y) >>>>>> [1] 2 >>>>>> >>>>>> z >>>>>> IRanges of length 1 >>>>>> start end width >>>>>> [1] 3 3 1 >>>>>> >>>>>> distance(x, z) >>>>>> [1] 2 >>>>>> >>>>>> On Oct 15, 2012, at 6:27 PM, Valerie Obenchain wrote: >>>>>> >>>>>> A few follow up items from an off-list conversation with Harris. >>>>>> >>>>>> 1) I reversed the use of 5' and3' in my response below. Sorry about >>> that. >>>>>> >>>>>> 2) The primary landing page for ?precede is in IRanges. I've >>> expanded >>> the \seealso section of this page to point the user to the man page in >>> GenomicRanges. >>>>>> >>>>>> 3) The precede/follow man page in GenomicRanges now specifically >>> states that 5' to 3' is the relevant orientation for precede/follow. For >>> those interested in taking a look, the man page has several aliases, >>>>>> >>>>>> ?'nearest-methods' >>>>>> ?'precede,GenomicRanges,GenomicRanges-method' >>>>>> ?'follow,GenomicRanges,GenomicRanges-method' >>>>>> >>>>>> Changes can be found in IRanges 1.17.3 and GenomicRanges 1.11.4. >>>>>> >>>>>> Thanks Harris. >>>>>> >>>>>> Valerie >>>>>> >>>>>> On 10/12/2012 04:17 PM, Valerie Obenchain wrote: >>>>>> Hi Harris, >>>>>> >>>>>> On 10/11/2012 02:27 PM, Harris A. Jaffee wrote: >>>>>> Apologies in advance if I just don't understand the ignore.strand >>> switch, or >>>>>> perhaps GRanges objects. Also, I have not even tried to understand >>>>>> >>>>>> GenomicRanges:::.GenomicRanges_findPrecedeFollow >>>>>> >>>>>> I have the impression that a query with strand entirely "*" >>> essentially implies >>>>>> ignore.strand, but this case here is handled correctly only if >>> ignore.strand is >>>>>> TRUE (consistent with distanceToNearest but not distance, which >>> seems >>> correct): >>>>>> >>>>>> x = GRanges(ranges=IRanges(start=5, end=5), seqnames="chr1", >>> strand="*") >>>>>> Y = GRanges(ranges=IRanges(start=c(6,7), end=c(6,7)), >>> seqnames="chr1", >>> strand=c("+","-")) >>>>>> distance(x, Y[1]) >>>>>> [1] 1 >>>>>> distance(x, Y[2]) >>>>>> [1] 2 >>>>>> >>>>>> nearest(x, Y, ignore.strand=TRUE) # correct >>>>>> [1] 1 >>>>>> nearest(ranges(x), ranges(Y)) # also correct >>>>>> [1] 1 >>>>>> >>>>>> However, >>>>>> >>>>>> nearest(x, Y) >>>>>> [1] 2 >>>>>> Thanks for reporting this bug. Now fixed in GenomicRanges 1.11.3 and >>> 1.10.2. >>>>>> >>>>>> The problem was in .nearest() where I computed the preceding and >>> following distance of the ranges. I was missing abs() so the distance >>> for >>> the second range was -2 instead of 2. When checking to see which range >>> was >>> closest >>>>>> >>>>>> 1< -2 >>>>>> [1] FALSE >>>>>> >>>>>> so the second range won. >>>>>> distanceToNearest(x, Y) >>>>>> DataFrame with 1 row and 3 columns >>>>>> queryHits subjectHits distance >>>>>> <integer> <integer> <integer> >>>>>> 1 1 2 2 >>>>>> >>>>>> distanceToNearest(x, Y, ignore.strand=TRUE) >>>>>> DataFrame with 1 row and 3 columns >>>>>> queryHits subjectHits distance >>>>>> <integer> <integer> <integer> >>>>>> 1 1 1 1 >>>>>> >>>>>> >>>>>> Finally, >>>>>> >>>>>> follow(ranges(x), ranges(Y)) >>>>>> [1] NA >>>>>> >>>>>> The issue (along with GenomicRanges:::.nearest) must come down to >>> this: >>>>>> >>>>>> follow(x, Y) # how can this be? >>>>>> [1] 2 >>>>>> This behavior is due to the fact that a '*' range can compare itself >>> to either '+' or '-'. The 'x' is '*" located at position 5. When >>> comparing it to the '+' Y[1] >>>>>> we think of both ranges as '+'. precede() and follow() locations are >>> determined by moving from 3' to 5'. On '+' strand this is from left to >>> right so 5 precedes 6. >>>>>> >>>>>> precede(x, Y[1]) >>>>>> [1] 1 >>>>>> >>>>>> but does not follow it >>>>>> >>>>>> follow(x, Y[1]) >>>>>> [1] NA >>>>>> >>>>>> When comparing the '*' to Y[2] we think of both ranges as '-'. On >>> the >>> '-' moving from 3' to 5' is right to left so the 5 follows 7 >>>>>> >>>>>> follow(x, Y[2]) >>>>>> [1] 1 >>>>>> >>>>>> but does not precede it >>>>>> >>>>>> precede(x, Y[2]) >>>>>> [1] NA >>>>>> >>>>>> When we set ignore.strand=TRUE, all ranges are considered '+'. >>>>>> >>>>>> Valerie >>>>>> whereas >>>>>> >>>>>> follow(x, Y, ignore.strand=TRUE) # correct, I think >>>>>> [1] NA >>>>>> >>>>>> sessionInfo() >>>>>> R Under development (unstable) (2012-10-10 r60908) >>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>> >>>>>> locale: >>>>>> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C >>>>>> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 >>>>>> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 >>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C >>>>>> >>>>>> attached base packages: >>>>>> [1] stats graphics grDevices datasets utils methods base >>>>>> >>>>>> other attached packages: >>>>>> [1] GenomicRanges_1.11.0 IRanges_1.17.0 BiocGenerics_0.5.0 >>>>>> >>>>>> loaded via a namespace (and not attached): >>>>>> [1] parallel_2.16.0 stats4_2.16.0 tools_2.16.0 >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>>> >>>>> >>>> >>>> -- >>>> Hervé Pagès >>>> >>>> Program in Computational Biology >>>> Division of Public Health Sciences >>>> Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N, M1-B514 >>>> P.O. Box 19024 >>>> Seattle, WA 98109-1024 >>>> >>>> E-mail: hpages at fhcrc.org >>>> Phone: (206) 667-5791 >>>> Fax: (206) 667-1319 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> >> >> -- >> *A model is a lie that helps you see the truth.* >> * >> * >> Howard >> Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >> >> [[alternative HTML version deleted]] >> >
ADD REPLY
0
Entering edit mode
Hi, On 10/17/2012 09:22 AM, Harris A. Jaffee wrote: > On Oct 17, 2012, at 11:31 AM, Cook, Malcolm wrote: > >> Indeed, for distance to behave as a metric it must be that distance(x,x) >> == 0 for all x >> >> Here is some further (consistent) oddness: >> >>> compare(GRanges('x',IRanges(4,3)),GRanges('x',IRanges(4,3))) >> [1] -5 > > Thanks for the tip on compare. Looks like a nice classification scheme. > I should start reading vignettes! > >> I would expect the value to be 0 here. Comparing x with itself should >> yield 0. After all they are identical! > > Yeah, identical but *empty*. Therefore, none of the scenarios depicted > in the man page seems appropriate (even 0/g): > > -6 a: x[i]: .oooo....... 6 m: x[i]: .......oooo. > y[i]: .......oooo. y[i]: .oooo....... > > -5 b: x[i]: ..oooo...... 5 l: x[i]: ......oooo.. > y[i]: ......oooo.. y[i]: ..oooo...... > > -4 c: x[i]: ...oooo..... 4 k: x[i]: .....oooo... > y[i]: .....oooo... y[i]: ...oooo..... > > -3 d: x[i]: ...oooooo... 3 j: x[i]: .....oooo... > y[i]: .....oooo... y[i]: ...oooooo... > > -2 e: x[i]: ..oooooooo.. 2 i: x[i]: ....oooo.... > y[i]: ....oooo.... y[i]: ..oooooooo.. > > -1 f: x[i]: ...oooo..... 1 h: x[i]: ...oooooo... > y[i]: ...oooooo... y[i]: ...oooo..... > > 0 g: x[i]: ...oooooo... > y[i]: ...oooooo... > > > Again, I like NA in these cases. compare() should be consistent with the comparison binary operators (==, !=, <=, <, >=, >), in the sense that we should have: x == y <=> compare(x, y) == 0 x != y <=> compare(x, y) != 0 x <= y <=> compare(x, y) <= 0 etc... Right now those operations are actually using compare() behind the scene when the operands are GRanges objects: > gr <- GRanges('x',IRanges(4,3)) > gr == gr [1] FALSE So compare(x, x) needs to be fixed to always return 0. Note that the comparison binary operators on Ranges objects are not based on compare() yet so even though compare(ir, ir) is not 0 here, we get a TRUE for ir == ir: > compare(ir, ir) [1] -5 > ir == ir [1] TRUE I have on my list to change the current implementation of the comparison binary operators on Ranges objects so they also use compare() behind the scene (that will make them about twice faster because compare() is implemented in C). I'll fix compare() right away. Thanks for the feedback! H. > > >> --Malcolm >> >> >> On 10/17/12 9:50 AM, "Tim Triche, Jr." <tim.triche at="" gmail.com=""> wrote: >> >>> Hmm, that's odd. And it seems inconsistent with the usual distance() and >>> distanceToNearest() methods, which would return 0 in this case, yes? >>> >>> e.g. if the IRanges were GRanges and sitting on top of each other on the >>> same strand, or if the first was distance()'ed against itself (the same >>> thing) >>> >>> That doesn't make sense to me either. This doesn't seem like an issue of >>> theology to me, but rather inconsistency with what the docs say will >>> happen. >>> >>> >>> >>> On Wed, Oct 17, 2012 at 7:31 AM, Harris A. Jaffee <hj at="" jhu.edu=""> wrote: >>> >>>> As I said, any rule is fine, especially if they can serve a purpose >>>> via their "location". But you also have this, which seems weird: >>>> >>>>> distance(IRanges(2,1), IRanges(2,1)) >>>> [1] 1 >>>> >>>> To say they are not treated any differently is a big stretch to me, >>>> since their defining formula (end = start-1) is already a violation >>>> of my senses. [I wonder if SEW = (start, NA, 0) wouldn't have been >>>> more useful.] >>>> >>>> Just so we don't continue off into useless theology, I should say that >>>> I started looking into the distance code with the idea of adding a >>>> "signed" option (say, distance(x,y)<0 iff x is downstream from y), and >>>> I wanted to understand the landscape first. >>>> >>>> On Oct 17, 2012, at 12:56 AM, Hervé Pagès wrote: >>>> >>>>> Hi Harris, Val, Michael, >>>>> >>>>> Right now, all those distances are equal to 8, which I think is what >>>>> we want: >>>>> >>>>> distance(IRanges(1,2), IRanges(10,16)) >>>>> distance(IRanges(10,16), IRanges(1,2)) >>>>> distance(IRanges(1,2), IRanges(10,9)) >>>>> distance(IRanges(10,9),IRanges(1,2)) >>>>> distance(IRanges(3,2), IRanges(10,16)) >>>>> distance(IRanges(10,16), IRanges(3,2)) >>>>> distance(IRanges(3,2), IRanges(10,9)) >>>>> distance(IRanges(10,9), IRanges(3,2)) >>>>> >>>>> Yes, from a mathematical point of view, the distance between a >>>>> non-empty set and an empty set if not defined, but I'm not sure there >>>>> would be much to gain in doing this. Some use cases would be needed. >>>>> What's nice about the current behavior is that zero-width ranges are >>>>> not treated any differently than other ranges. >>>>> >>>>> H. >>>>> >>>>> >>>>> On 10/16/2012 12:21 PM, Harris A. Jaffee wrote: >>>>>> Any rule is fine with me. They would not occur for us except by >>>> mistake. >>>>>> I was basically raising an alert that the current situation is not >>>> quite >>>>>> complete. >>>>>> >>>>>> On Oct 16, 2012, at 2:53 PM, Michael Lawrence wrote: >>>>>> >>>>>>> Since these zero-width ranges do have a position, I'm not sure why >>>> we >>>> cannot calculate their distance. As long as we have an established rule >>>> about how we handle them. For example, is it just from the start? That's >>>> probably most intuitive. >>>>>>> >>>>>>> Michael >>>>>>> >>>>>>> On Tue, Oct 16, 2012 at 11:25 AM, Valerie Obenchain < >>>> vobencha at fhcrc.org> wrote: >>>>>>> On 10/16/2012 09:25 AM, Harris A. Jaffee wrote: >>>>>>> Fantastic, thank you. I'm not sure I knew about that page! >>>>>>> >>>>>>> For example, ?precede gives me the IRanges nearest-methods page, >>>>>>> and only if I explicitly request ?'nearest-methods' can I choose >>>>>>> the GenomicRanges page. That brings up another issue: ?distance >>>>>>> again sends me to the IRanges nearest-methods help, and that only >>>>>>> documents distanceToNearest, which is not enough. The spec for >>>>>>> distance actually lives only on >>>>>>> ?'precede,GenomicRanges,GenomicRanges-method', as far as I see. >>>>>>> >>>>>>> I can update the IRanges man page to include distance(). >>>>>>> >>>>>>> Finally, and admittedly a total edge case, but zero-width ranges >>>>>>> appear to be mishandled, or not handled, by the IRanges distance >>>>>>> code. I think they should be infinitely far from anything, or at >>>>>>> least NA, and there should not be any nearest or distanceToNearest >>>>>>> value for a zero-width query. >>>>>>> cc'ing Herve and Michael for history. >>>>>>> >>>>>>> In general I agree that zero-width ranges should not have a distance >>>> from any other range. I favor returning NA instead of inf since the >>>> range >>>> (albeit zero-width) does have a location. >>>>>>> >>>>>>> One concern I have is that we use zero-width ranges to represent >>>> insertions. In that context we may want to compute the distance to the >>>> nearest insertion. I do not have a concrete use case - just thinking of >>>> what might come. I'd be interested in opinions on this. >>>>>>> >>>>>>> Valerie >>>>>>> >>>>>>> >>>>>>> >>>>>>> x >>>>>>> IRanges of length 1 >>>>>>> start end width >>>>>>> [1] 2 1 0 >>>>>>> >>>>>>> distance(x, x) >>>>>>> [1] 1 >>>>>>> >>>>>>> y >>>>>>> IRanges of length 1 >>>>>>> start end width >>>>>>> [1] 3 2 0 >>>>>>> >>>>>>> distance(x, y) >>>>>>> [1] 2 >>>>>>> >>>>>>> z >>>>>>> IRanges of length 1 >>>>>>> start end width >>>>>>> [1] 3 3 1 >>>>>>> >>>>>>> distance(x, z) >>>>>>> [1] 2 >>>>>>> >>>>>>> On Oct 15, 2012, at 6:27 PM, Valerie Obenchain wrote: >>>>>>> >>>>>>> A few follow up items from an off-list conversation with Harris. >>>>>>> >>>>>>> 1) I reversed the use of 5' and3' in my response below. Sorry about >>>> that. >>>>>>> >>>>>>> 2) The primary landing page for ?precede is in IRanges. I've >>>> expanded >>>> the \seealso section of this page to point the user to the man page in >>>> GenomicRanges. >>>>>>> >>>>>>> 3) The precede/follow man page in GenomicRanges now specifically >>>> states that 5' to 3' is the relevant orientation for precede/follow. For >>>> those interested in taking a look, the man page has several aliases, >>>>>>> >>>>>>> ?'nearest-methods' >>>>>>> ?'precede,GenomicRanges,GenomicRanges-method' >>>>>>> ?'follow,GenomicRanges,GenomicRanges-method' >>>>>>> >>>>>>> Changes can be found in IRanges 1.17.3 and GenomicRanges 1.11.4. >>>>>>> >>>>>>> Thanks Harris. >>>>>>> >>>>>>> Valerie >>>>>>> >>>>>>> On 10/12/2012 04:17 PM, Valerie Obenchain wrote: >>>>>>> Hi Harris, >>>>>>> >>>>>>> On 10/11/2012 02:27 PM, Harris A. Jaffee wrote: >>>>>>> Apologies in advance if I just don't understand the ignore.strand >>>> switch, or >>>>>>> perhaps GRanges objects. Also, I have not even tried to understand >>>>>>> >>>>>>> GenomicRanges:::.GenomicRanges_findPrecedeFollow >>>>>>> >>>>>>> I have the impression that a query with strand entirely "*" >>>> essentially implies >>>>>>> ignore.strand, but this case here is handled correctly only if >>>> ignore.strand is >>>>>>> TRUE (consistent with distanceToNearest but not distance, which >>>> seems >>>> correct): >>>>>>> >>>>>>> x = GRanges(ranges=IRanges(start=5, end=5), seqnames="chr1", >>>> strand="*") >>>>>>> Y = GRanges(ranges=IRanges(start=c(6,7), end=c(6,7)), >>>> seqnames="chr1", >>>> strand=c("+","-")) >>>>>>> distance(x, Y[1]) >>>>>>> [1] 1 >>>>>>> distance(x, Y[2]) >>>>>>> [1] 2 >>>>>>> >>>>>>> nearest(x, Y, ignore.strand=TRUE) # correct >>>>>>> [1] 1 >>>>>>> nearest(ranges(x), ranges(Y)) # also correct >>>>>>> [1] 1 >>>>>>> >>>>>>> However, >>>>>>> >>>>>>> nearest(x, Y) >>>>>>> [1] 2 >>>>>>> Thanks for reporting this bug. Now fixed in GenomicRanges 1.11.3 and >>>> 1.10.2. >>>>>>> >>>>>>> The problem was in .nearest() where I computed the preceding and >>>> following distance of the ranges. I was missing abs() so the distance >>>> for >>>> the second range was -2 instead of 2. When checking to see which range >>>> was >>>> closest >>>>>>> >>>>>>> 1< -2 >>>>>>> [1] FALSE >>>>>>> >>>>>>> so the second range won. >>>>>>> distanceToNearest(x, Y) >>>>>>> DataFrame with 1 row and 3 columns >>>>>>> queryHits subjectHits distance >>>>>>> <integer> <integer> <integer> >>>>>>> 1 1 2 2 >>>>>>> >>>>>>> distanceToNearest(x, Y, ignore.strand=TRUE) >>>>>>> DataFrame with 1 row and 3 columns >>>>>>> queryHits subjectHits distance >>>>>>> <integer> <integer> <integer> >>>>>>> 1 1 1 1 >>>>>>> >>>>>>> >>>>>>> Finally, >>>>>>> >>>>>>> follow(ranges(x), ranges(Y)) >>>>>>> [1] NA >>>>>>> >>>>>>> The issue (along with GenomicRanges:::.nearest) must come down to >>>> this: >>>>>>> >>>>>>> follow(x, Y) # how can this be? >>>>>>> [1] 2 >>>>>>> This behavior is due to the fact that a '*' range can compare itself >>>> to either '+' or '-'. The 'x' is '*" located at position 5. When >>>> comparing it to the '+' Y[1] >>>>>>> we think of both ranges as '+'. precede() and follow() locations are >>>> determined by moving from 3' to 5'. On '+' strand this is from left to >>>> right so 5 precedes 6. >>>>>>> >>>>>>> precede(x, Y[1]) >>>>>>> [1] 1 >>>>>>> >>>>>>> but does not follow it >>>>>>> >>>>>>> follow(x, Y[1]) >>>>>>> [1] NA >>>>>>> >>>>>>> When comparing the '*' to Y[2] we think of both ranges as '-'. On >>>> the >>>> '-' moving from 3' to 5' is right to left so the 5 follows 7 >>>>>>> >>>>>>> follow(x, Y[2]) >>>>>>> [1] 1 >>>>>>> >>>>>>> but does not precede it >>>>>>> >>>>>>> precede(x, Y[2]) >>>>>>> [1] NA >>>>>>> >>>>>>> When we set ignore.strand=TRUE, all ranges are considered '+'. >>>>>>> >>>>>>> Valerie >>>>>>> whereas >>>>>>> >>>>>>> follow(x, Y, ignore.strand=TRUE) # correct, I think >>>>>>> [1] NA >>>>>>> >>>>>>> sessionInfo() >>>>>>> R Under development (unstable) (2012-10-10 r60908) >>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>>> >>>>>>> locale: >>>>>>> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C >>>>>>> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 >>>>>>> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 >>>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C >>>>>>> >>>>>>> attached base packages: >>>>>>> [1] stats graphics grDevices datasets utils methods base >>>>>>> >>>>>>> other attached packages: >>>>>>> [1] GenomicRanges_1.11.0 IRanges_1.17.0 BiocGenerics_0.5.0 >>>>>>> >>>>>>> loaded via a namespace (and not attached): >>>>>>> [1] parallel_2.16.0 stats4_2.16.0 tools_2.16.0 >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Hervé Pagès >>>>> >>>>> Program in Computational Biology >>>>> Division of Public Health Sciences >>>>> Fred Hutchinson Cancer Research Center >>>>> 1100 Fairview Ave. N, M1-B514 >>>>> P.O. Box 19024 >>>>> Seattle, WA 98109-1024 >>>>> >>>>> E-mail: hpages at fhcrc.org >>>>> Phone: (206) 667-5791 >>>>> Fax: (206) 667-1319 >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> >>> >>> -- >>> *A model is a lie that helps you see the truth.* >>> * >>> * >>> Howard >>> Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >>> >>> [[alternative HTML version deleted]] >>> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
Thanks Herve for taking care of compare(). Harris, you mentioned a 'signed' option for distance(). Tim is working on a patch for distanceToNearest() for GRanges that will return signed distances. The better place to make this change would be at the distance() level. The two of you may want to join forces on this? I think we now agree on the following, (1) distance() should return 0 when zero-width ranges are compared. distance(IRanges(2,1), IRanges(2,1)) ## should return 0 (2) distance() between a zero-width and non-zero-width range will be computed as always, i.e., Herve's examples hold where all distances are equal to 8, distance(IRanges(1,2), IRanges(10,16)) distance(IRanges(10,16), IRanges(1,2)) distance(IRanges(1,2), IRanges(10,9)) distance(IRanges(10,9),IRanges(1,2)) distance(IRanges(3,2), IRanges(10,16)) distance(IRanges(10,16), IRanges(3,2)) distance(IRanges(3,2), IRanges(10,9)) distance(IRanges(10,9), IRanges(3,2)) I will make the necessary changes to distance(). Thanks for the input. Valerie On 10/17/2012 10:41 AM, Hervé Pagès wrote: > Hi, > > On 10/17/2012 09:22 AM, Harris A. Jaffee wrote: >> On Oct 17, 2012, at 11:31 AM, Cook, Malcolm wrote: >> >>> Indeed, for distance to behave as a metric it must be that >>> distance(x,x) >>> == 0 for all x >>> >>> Here is some further (consistent) oddness: >>> >>>> compare(GRanges('x',IRanges(4,3)),GRanges('x',IRanges(4,3))) >>> [1] -5 >> >> Thanks for the tip on compare. Looks like a nice classification scheme. >> I should start reading vignettes! >> >>> I would expect the value to be 0 here. Comparing x with itself should >>> yield 0. After all they are identical! >> >> Yeah, identical but *empty*. Therefore, none of the scenarios depicted >> in the man page seems appropriate (even 0/g): >> >> -6 a: x[i]: .oooo....... 6 m: x[i]: >> .......oooo. >> y[i]: .......oooo. y[i]: >> .oooo....... >> >> -5 b: x[i]: ..oooo...... 5 l: x[i]: >> ......oooo.. >> y[i]: ......oooo.. y[i]: >> ..oooo...... >> >> -4 c: x[i]: ...oooo..... 4 k: x[i]: >> .....oooo... >> y[i]: .....oooo... y[i]: >> ...oooo..... >> >> -3 d: x[i]: ...oooooo... 3 j: x[i]: >> .....oooo... >> y[i]: .....oooo... y[i]: >> ...oooooo... >> >> -2 e: x[i]: ..oooooooo.. 2 i: x[i]: >> ....oooo.... >> y[i]: ....oooo.... y[i]: >> ..oooooooo.. >> >> -1 f: x[i]: ...oooo..... 1 h: x[i]: >> ...oooooo... >> y[i]: ...oooooo... y[i]: >> ...oooo..... >> >> 0 g: x[i]: ...oooooo... >> y[i]: ...oooooo... >> >> >> Again, I like NA in these cases. > > compare() should be consistent with the comparison binary operators > (==, !=, <=, <, >=, >), in the sense that we should have: > > x == y <=> compare(x, y) == 0 > x != y <=> compare(x, y) != 0 > x <= y <=> compare(x, y) <= 0 > etc... > > Right now those operations are actually using compare() behind the > scene when the operands are GRanges objects: > > > gr <- GRanges('x',IRanges(4,3)) > > gr == gr > [1] FALSE > > So compare(x, x) needs to be fixed to always return 0. > > Note that the comparison binary operators on Ranges objects are > not based on compare() yet so even though compare(ir, ir) is not > 0 here, we get a TRUE for ir == ir: > > > compare(ir, ir) > [1] -5 > > ir == ir > [1] TRUE > > I have on my list to change the current implementation of the > comparison binary operators on Ranges objects so they also use > compare() behind the scene (that will make them about twice > faster because compare() is implemented in C). > > I'll fix compare() right away. Thanks for the feedback! > > H. > > >> >> >>> --Malcolm >>> >>> >>> On 10/17/12 9:50 AM, "Tim Triche, Jr." <tim.triche at="" gmail.com=""> wrote: >>> >>>> Hmm, that's odd. And it seems inconsistent with the usual >>>> distance() and >>>> distanceToNearest() methods, which would return 0 in this case, yes? >>>> >>>> e.g. if the IRanges were GRanges and sitting on top of each other >>>> on the >>>> same strand, or if the first was distance()'ed against itself (the >>>> same >>>> thing) >>>> >>>> That doesn't make sense to me either. This doesn't seem like an >>>> issue of >>>> theology to me, but rather inconsistency with what the docs say will >>>> happen. >>>> >>>> >>>> >>>> On Wed, Oct 17, 2012 at 7:31 AM, Harris A. Jaffee <hj at="" jhu.edu=""> wrote: >>>> >>>>> As I said, any rule is fine, especially if they can serve a purpose >>>>> via their "location". But you also have this, which seems weird: >>>>> >>>>>> distance(IRanges(2,1), IRanges(2,1)) >>>>> [1] 1 >>>>> >>>>> To say they are not treated any differently is a big stretch to me, >>>>> since their defining formula (end = start-1) is already a violation >>>>> of my senses. [I wonder if SEW = (start, NA, 0) wouldn't have been >>>>> more useful.] >>>>> >>>>> Just so we don't continue off into useless theology, I should say >>>>> that >>>>> I started looking into the distance code with the idea of adding a >>>>> "signed" option (say, distance(x,y)<0 iff x is downstream from y), >>>>> and >>>>> I wanted to understand the landscape first. >>>>> >>>>> On Oct 17, 2012, at 12:56 AM, Hervé Pagès wrote: >>>>> >>>>>> Hi Harris, Val, Michael, >>>>>> >>>>>> Right now, all those distances are equal to 8, which I think is what >>>>>> we want: >>>>>> >>>>>> distance(IRanges(1,2), IRanges(10,16)) >>>>>> distance(IRanges(10,16), IRanges(1,2)) >>>>>> distance(IRanges(1,2), IRanges(10,9)) >>>>>> distance(IRanges(10,9),IRanges(1,2)) >>>>>> distance(IRanges(3,2), IRanges(10,16)) >>>>>> distance(IRanges(10,16), IRanges(3,2)) >>>>>> distance(IRanges(3,2), IRanges(10,9)) >>>>>> distance(IRanges(10,9), IRanges(3,2)) >>>>>> >>>>>> Yes, from a mathematical point of view, the distance between a >>>>>> non-empty set and an empty set if not defined, but I'm not sure >>>>>> there >>>>>> would be much to gain in doing this. Some use cases would be needed. >>>>>> What's nice about the current behavior is that zero-width ranges are >>>>>> not treated any differently than other ranges. >>>>>> >>>>>> H. >>>>>> >>>>>> >>>>>> On 10/16/2012 12:21 PM, Harris A. Jaffee wrote: >>>>>>> Any rule is fine with me. They would not occur for us except by >>>>> mistake. >>>>>>> I was basically raising an alert that the current situation is not >>>>> quite >>>>>>> complete. >>>>>>> >>>>>>> On Oct 16, 2012, at 2:53 PM, Michael Lawrence wrote: >>>>>>> >>>>>>>> Since these zero-width ranges do have a position, I'm not sure why >>>>> we >>>>> cannot calculate their distance. As long as we have an established >>>>> rule >>>>> about how we handle them. For example, is it just from the start? >>>>> That's >>>>> probably most intuitive. >>>>>>>> >>>>>>>> Michael >>>>>>>> >>>>>>>> On Tue, Oct 16, 2012 at 11:25 AM, Valerie Obenchain < >>>>> vobencha at fhcrc.org> wrote: >>>>>>>> On 10/16/2012 09:25 AM, Harris A. Jaffee wrote: >>>>>>>> Fantastic, thank you. I'm not sure I knew about that page! >>>>>>>> >>>>>>>> For example, ?precede gives me the IRanges nearest-methods page, >>>>>>>> and only if I explicitly request ?'nearest-methods' can I choose >>>>>>>> the GenomicRanges page. That brings up another issue: ?distance >>>>>>>> again sends me to the IRanges nearest-methods help, and that only >>>>>>>> documents distanceToNearest, which is not enough. The spec for >>>>>>>> distance actually lives only on >>>>>>>> ?'precede,GenomicRanges,GenomicRanges-method', as far as I see. >>>>>>>> >>>>>>>> I can update the IRanges man page to include distance(). >>>>>>>> >>>>>>>> Finally, and admittedly a total edge case, but zero-width ranges >>>>>>>> appear to be mishandled, or not handled, by the IRanges distance >>>>>>>> code. I think they should be infinitely far from anything, or at >>>>>>>> least NA, and there should not be any nearest or distanceToNearest >>>>>>>> value for a zero-width query. >>>>>>>> cc'ing Herve and Michael for history. >>>>>>>> >>>>>>>> In general I agree that zero-width ranges should not have a >>>>>>>> distance >>>>> from any other range. I favor returning NA instead of inf since the >>>>> range >>>>> (albeit zero-width) does have a location. >>>>>>>> >>>>>>>> One concern I have is that we use zero-width ranges to represent >>>>> insertions. In that context we may want to compute the distance to >>>>> the >>>>> nearest insertion. I do not have a concrete use case - just >>>>> thinking of >>>>> what might come. I'd be interested in opinions on this. >>>>>>>> >>>>>>>> Valerie >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> x >>>>>>>> IRanges of length 1 >>>>>>>> start end width >>>>>>>> [1] 2 1 0 >>>>>>>> >>>>>>>> distance(x, x) >>>>>>>> [1] 1 >>>>>>>> >>>>>>>> y >>>>>>>> IRanges of length 1 >>>>>>>> start end width >>>>>>>> [1] 3 2 0 >>>>>>>> >>>>>>>> distance(x, y) >>>>>>>> [1] 2 >>>>>>>> >>>>>>>> z >>>>>>>> IRanges of length 1 >>>>>>>> start end width >>>>>>>> [1] 3 3 1 >>>>>>>> >>>>>>>> distance(x, z) >>>>>>>> [1] 2 >>>>>>>> >>>>>>>> On Oct 15, 2012, at 6:27 PM, Valerie Obenchain wrote: >>>>>>>> >>>>>>>> A few follow up items from an off-list conversation with Harris. >>>>>>>> >>>>>>>> 1) I reversed the use of 5' and3' in my response below. Sorry >>>>>>>> about >>>>> that. >>>>>>>> >>>>>>>> 2) The primary landing page for ?precede is in IRanges. I've >>>>> expanded >>>>> the \seealso section of this page to point the user to the man >>>>> page in >>>>> GenomicRanges. >>>>>>>> >>>>>>>> 3) The precede/follow man page in GenomicRanges now specifically >>>>> states that 5' to 3' is the relevant orientation for >>>>> precede/follow. For >>>>> those interested in taking a look, the man page has several aliases, >>>>>>>> >>>>>>>> ?'nearest-methods' >>>>>>>> ?'precede,GenomicRanges,GenomicRanges-method' >>>>>>>> ?'follow,GenomicRanges,GenomicRanges-method' >>>>>>>> >>>>>>>> Changes can be found in IRanges 1.17.3 and GenomicRanges 1.11.4. >>>>>>>> >>>>>>>> Thanks Harris. >>>>>>>> >>>>>>>> Valerie >>>>>>>> >>>>>>>> On 10/12/2012 04:17 PM, Valerie Obenchain wrote: >>>>>>>> Hi Harris, >>>>>>>> >>>>>>>> On 10/11/2012 02:27 PM, Harris A. Jaffee wrote: >>>>>>>> Apologies in advance if I just don't understand the ignore.strand >>>>> switch, or >>>>>>>> perhaps GRanges objects. Also, I have not even tried to >>>>>>>> understand >>>>>>>> >>>>>>>> GenomicRanges:::.GenomicRanges_findPrecedeFollow >>>>>>>> >>>>>>>> I have the impression that a query with strand entirely "*" >>>>> essentially implies >>>>>>>> ignore.strand, but this case here is handled correctly only if >>>>> ignore.strand is >>>>>>>> TRUE (consistent with distanceToNearest but not distance, which >>>>> seems >>>>> correct): >>>>>>>> >>>>>>>> x = GRanges(ranges=IRanges(start=5, end=5), seqnames="chr1", >>>>> strand="*") >>>>>>>> Y = GRanges(ranges=IRanges(start=c(6,7), end=c(6,7)), >>>>> seqnames="chr1", >>>>> strand=c("+","-")) >>>>>>>> distance(x, Y[1]) >>>>>>>> [1] 1 >>>>>>>> distance(x, Y[2]) >>>>>>>> [1] 2 >>>>>>>> >>>>>>>> nearest(x, Y, ignore.strand=TRUE) # correct >>>>>>>> [1] 1 >>>>>>>> nearest(ranges(x), ranges(Y)) # also correct >>>>>>>> [1] 1 >>>>>>>> >>>>>>>> However, >>>>>>>> >>>>>>>> nearest(x, Y) >>>>>>>> [1] 2 >>>>>>>> Thanks for reporting this bug. Now fixed in GenomicRanges >>>>>>>> 1.11.3 and >>>>> 1.10.2. >>>>>>>> >>>>>>>> The problem was in .nearest() where I computed the preceding and >>>>> following distance of the ranges. I was missing abs() so the distance >>>>> for >>>>> the second range was -2 instead of 2. When checking to see which >>>>> range >>>>> was >>>>> closest >>>>>>>> >>>>>>>> 1< -2 >>>>>>>> [1] FALSE >>>>>>>> >>>>>>>> so the second range won. >>>>>>>> distanceToNearest(x, Y) >>>>>>>> DataFrame with 1 row and 3 columns >>>>>>>> queryHits subjectHits distance >>>>>>>> <integer> <integer> <integer> >>>>>>>> 1 1 2 2 >>>>>>>> >>>>>>>> distanceToNearest(x, Y, ignore.strand=TRUE) >>>>>>>> DataFrame with 1 row and 3 columns >>>>>>>> queryHits subjectHits distance >>>>>>>> <integer> <integer> <integer> >>>>>>>> 1 1 1 1 >>>>>>>> >>>>>>>> >>>>>>>> Finally, >>>>>>>> >>>>>>>> follow(ranges(x), ranges(Y)) >>>>>>>> [1] NA >>>>>>>> >>>>>>>> The issue (along with GenomicRanges:::.nearest) must come down to >>>>> this: >>>>>>>> >>>>>>>> follow(x, Y) # how can this be? >>>>>>>> [1] 2 >>>>>>>> This behavior is due to the fact that a '*' range can compare >>>>>>>> itself >>>>> to either '+' or '-'. The 'x' is '*" located at position 5. When >>>>> comparing it to the '+' Y[1] >>>>>>>> we think of both ranges as '+'. precede() and follow() >>>>>>>> locations are >>>>> determined by moving from 3' to 5'. On '+' strand this is from >>>>> left to >>>>> right so 5 precedes 6. >>>>>>>> >>>>>>>> precede(x, Y[1]) >>>>>>>> [1] 1 >>>>>>>> >>>>>>>> but does not follow it >>>>>>>> >>>>>>>> follow(x, Y[1]) >>>>>>>> [1] NA >>>>>>>> >>>>>>>> When comparing the '*' to Y[2] we think of both ranges as '-'. On >>>>> the >>>>> '-' moving from 3' to 5' is right to left so the 5 follows 7 >>>>>>>> >>>>>>>> follow(x, Y[2]) >>>>>>>> [1] 1 >>>>>>>> >>>>>>>> but does not precede it >>>>>>>> >>>>>>>> precede(x, Y[2]) >>>>>>>> [1] NA >>>>>>>> >>>>>>>> When we set ignore.strand=TRUE, all ranges are considered '+'. >>>>>>>> >>>>>>>> Valerie >>>>>>>> whereas >>>>>>>> >>>>>>>> follow(x, Y, ignore.strand=TRUE) # correct, I think >>>>>>>> [1] NA >>>>>>>> >>>>>>>> sessionInfo() >>>>>>>> R Under development (unstable) (2012-10-10 r60908) >>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>>>> >>>>>>>> locale: >>>>>>>> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C >>>>>>>> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 >>>>>>>> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 >>>>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C >>>>>>>> >>>>>>>> attached base packages: >>>>>>>> [1] stats graphics grDevices datasets utils methods >>>>>>>> base >>>>>>>> >>>>>>>> other attached packages: >>>>>>>> [1] GenomicRanges_1.11.0 IRanges_1.17.0 BiocGenerics_0.5.0 >>>>>>>> >>>>>>>> loaded via a namespace (and not attached): >>>>>>>> [1] parallel_2.16.0 stats4_2.16.0 tools_2.16.0 >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Hervé Pagès >>>>>> >>>>>> Program in Computational Biology >>>>>> Division of Public Health Sciences >>>>>> Fred Hutchinson Cancer Research Center >>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>> P.O. Box 19024 >>>>>> Seattle, WA 98109-1024 >>>>>> >>>>>> E-mail: hpages at fhcrc.org >>>>>> Phone: (206) 667-5791 >>>>>> Fax: (206) 667-1319 >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>> >>>> >>>> >>>> -- >>>> *A model is a lie that helps you see the truth.* >>>> * >>>> * >>>> Howard >>>> Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >>>> >>>> [[alternative HTML version deleted]] >>>> >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >
ADD REPLY
0
Entering edit mode
On 10/17/2012 04:02 PM, Valerie Obenchain wrote: > Thanks Herve for taking care of compare(). > > Harris, you mentioned a 'signed' option for distance(). Tim is working > on a patch for distanceToNearest() for GRanges that will return signed > distances. The better place to make this change would be at the > distance() level. To be clear, I meant modify distance,Ranges,Ranges then propagate to distance,GenomicRanges,GenomicRanges. Valerie
ADD REPLY
0
Entering edit mode
The behavior of distance() has been changed in IRanges 1.7.17 and propagated to GenomicRanges 1.11.5. To handle zero-width ranges in a consistent and (hopefully) intuitive way we've gone with a 'block' model representation. A range is made up of blocks of size 1. The blocks are adjacent to each other and there is no gap between them. With this model the distance between 2 consecutive blocks is 0L (previously it was 1L). The distance function now returns the number of gaps between the blocks. This is all explained on the ?distance man page. If it is not clear please let me know. Highlights : - distance() now returns 1 less than what it did in <= IRanges 1.7.16. - Adjacent or overlapping ranges return a distance of 0L. - A temporary warning is emitted from distance,Ranges,Ranges which will be removed when Bioconductor 2.13 is released. - distance(x, y) now allows 'x' and 'y' to have different lengths (shortest recycled to length of longest). Valerie On 10/17/2012 04:02 PM, Valerie Obenchain wrote: > Thanks Herve for taking care of compare(). > > Harris, you mentioned a 'signed' option for distance(). Tim is working > on a patch for distanceToNearest() for GRanges that will return signed > distances. The better place to make this change would be at the > distance() level. The two of you may want to join forces on this? > > I think we now agree on the following, > > (1) distance() should return 0 when zero-width ranges are compared. > > distance(IRanges(2,1), IRanges(2,1)) ## should return 0 > > (2) distance() between a zero-width and non-zero-width range will be > computed as always, i.e., Herve's examples hold where all distances > are equal to 8, > > distance(IRanges(1,2), IRanges(10,16)) > distance(IRanges(10,16), IRanges(1,2)) > distance(IRanges(1,2), IRanges(10,9)) > distance(IRanges(10,9),IRanges(1,2)) > distance(IRanges(3,2), IRanges(10,16)) > distance(IRanges(10,16), IRanges(3,2)) > distance(IRanges(3,2), IRanges(10,9)) > distance(IRanges(10,9), IRanges(3,2)) > > I will make the necessary changes to distance(). Thanks for the input. > > Valerie > > On 10/17/2012 10:41 AM, Hervé Pagès wrote: >> Hi, >> >> On 10/17/2012 09:22 AM, Harris A. Jaffee wrote: >>> On Oct 17, 2012, at 11:31 AM, Cook, Malcolm wrote: >>> >>>> Indeed, for distance to behave as a metric it must be that >>>> distance(x,x) >>>> == 0 for all x >>>> >>>> Here is some further (consistent) oddness: >>>> >>>>> compare(GRanges('x',IRanges(4,3)),GRanges('x',IRanges(4,3))) >>>> [1] -5 >>> >>> Thanks for the tip on compare. Looks like a nice classification >>> scheme. >>> I should start reading vignettes! >>> >>>> I would expect the value to be 0 here. Comparing x with itself should >>>> yield 0. After all they are identical! >>> >>> Yeah, identical but *empty*. Therefore, none of the scenarios depicted >>> in the man page seems appropriate (even 0/g): >>> >>> -6 a: x[i]: .oooo....... 6 m: x[i]: >>> .......oooo. >>> y[i]: .......oooo. y[i]: >>> .oooo....... >>> >>> -5 b: x[i]: ..oooo...... 5 l: x[i]: >>> ......oooo.. >>> y[i]: ......oooo.. y[i]: >>> ..oooo...... >>> >>> -4 c: x[i]: ...oooo..... 4 k: x[i]: >>> .....oooo... >>> y[i]: .....oooo... y[i]: >>> ...oooo..... >>> >>> -3 d: x[i]: ...oooooo... 3 j: x[i]: >>> .....oooo... >>> y[i]: .....oooo... y[i]: >>> ...oooooo... >>> >>> -2 e: x[i]: ..oooooooo.. 2 i: x[i]: >>> ....oooo.... >>> y[i]: ....oooo.... y[i]: >>> ..oooooooo.. >>> >>> -1 f: x[i]: ...oooo..... 1 h: x[i]: >>> ...oooooo... >>> y[i]: ...oooooo... y[i]: >>> ...oooo..... >>> >>> 0 g: x[i]: ...oooooo... >>> y[i]: ...oooooo... >>> >>> >>> Again, I like NA in these cases. >> >> compare() should be consistent with the comparison binary operators >> (==, !=, <=, <, >=, >), in the sense that we should have: >> >> x == y <=> compare(x, y) == 0 >> x != y <=> compare(x, y) != 0 >> x <= y <=> compare(x, y) <= 0 >> etc... >> >> Right now those operations are actually using compare() behind the >> scene when the operands are GRanges objects: >> >> > gr <- GRanges('x',IRanges(4,3)) >> > gr == gr >> [1] FALSE >> >> So compare(x, x) needs to be fixed to always return 0. >> >> Note that the comparison binary operators on Ranges objects are >> not based on compare() yet so even though compare(ir, ir) is not >> 0 here, we get a TRUE for ir == ir: >> >> > compare(ir, ir) >> [1] -5 >> > ir == ir >> [1] TRUE >> >> I have on my list to change the current implementation of the >> comparison binary operators on Ranges objects so they also use >> compare() behind the scene (that will make them about twice >> faster because compare() is implemented in C). >> >> I'll fix compare() right away. Thanks for the feedback! >> >> H. >> >> >>> >>> >>>> --Malcolm >>>> >>>> >>>> On 10/17/12 9:50 AM, "Tim Triche, Jr." <tim.triche at="" gmail.com=""> wrote: >>>> >>>>> Hmm, that's odd. And it seems inconsistent with the usual >>>>> distance() and >>>>> distanceToNearest() methods, which would return 0 in this case, yes? >>>>> >>>>> e.g. if the IRanges were GRanges and sitting on top of each other >>>>> on the >>>>> same strand, or if the first was distance()'ed against itself (the >>>>> same >>>>> thing) >>>>> >>>>> That doesn't make sense to me either. This doesn't seem like an >>>>> issue of >>>>> theology to me, but rather inconsistency with what the docs say will >>>>> happen. >>>>> >>>>> >>>>> >>>>> On Wed, Oct 17, 2012 at 7:31 AM, Harris A. Jaffee <hj at="" jhu.edu=""> wrote: >>>>> >>>>>> As I said, any rule is fine, especially if they can serve a purpose >>>>>> via their "location". But you also have this, which seems weird: >>>>>> >>>>>>> distance(IRanges(2,1), IRanges(2,1)) >>>>>> [1] 1 >>>>>> >>>>>> To say they are not treated any differently is a big stretch to me, >>>>>> since their defining formula (end = start-1) is already a violation >>>>>> of my senses. [I wonder if SEW = (start, NA, 0) wouldn't have been >>>>>> more useful.] >>>>>> >>>>>> Just so we don't continue off into useless theology, I should say >>>>>> that >>>>>> I started looking into the distance code with the idea of adding a >>>>>> "signed" option (say, distance(x,y)<0 iff x is downstream from >>>>>> y), and >>>>>> I wanted to understand the landscape first. >>>>>> >>>>>> On Oct 17, 2012, at 12:56 AM, Hervé Pagès wrote: >>>>>> >>>>>>> Hi Harris, Val, Michael, >>>>>>> >>>>>>> Right now, all those distances are equal to 8, which I think is >>>>>>> what >>>>>>> we want: >>>>>>> >>>>>>> distance(IRanges(1,2), IRanges(10,16)) >>>>>>> distance(IRanges(10,16), IRanges(1,2)) >>>>>>> distance(IRanges(1,2), IRanges(10,9)) >>>>>>> distance(IRanges(10,9),IRanges(1,2)) >>>>>>> distance(IRanges(3,2), IRanges(10,16)) >>>>>>> distance(IRanges(10,16), IRanges(3,2)) >>>>>>> distance(IRanges(3,2), IRanges(10,9)) >>>>>>> distance(IRanges(10,9), IRanges(3,2)) >>>>>>> >>>>>>> Yes, from a mathematical point of view, the distance between a >>>>>>> non-empty set and an empty set if not defined, but I'm not sure >>>>>>> there >>>>>>> would be much to gain in doing this. Some use cases would be >>>>>>> needed. >>>>>>> What's nice about the current behavior is that zero-width ranges >>>>>>> are >>>>>>> not treated any differently than other ranges. >>>>>>> >>>>>>> H. >>>>>>> >>>>>>> >>>>>>> On 10/16/2012 12:21 PM, Harris A. Jaffee wrote: >>>>>>>> Any rule is fine with me. They would not occur for us except by >>>>>> mistake. >>>>>>>> I was basically raising an alert that the current situation is not >>>>>> quite >>>>>>>> complete. >>>>>>>> >>>>>>>> On Oct 16, 2012, at 2:53 PM, Michael Lawrence wrote: >>>>>>>> >>>>>>>>> Since these zero-width ranges do have a position, I'm not sure >>>>>>>>> why >>>>>> we >>>>>> cannot calculate their distance. As long as we have an >>>>>> established rule >>>>>> about how we handle them. For example, is it just from the start? >>>>>> That's >>>>>> probably most intuitive. >>>>>>>>> >>>>>>>>> Michael >>>>>>>>> >>>>>>>>> On Tue, Oct 16, 2012 at 11:25 AM, Valerie Obenchain < >>>>>> vobencha at fhcrc.org> wrote: >>>>>>>>> On 10/16/2012 09:25 AM, Harris A. Jaffee wrote: >>>>>>>>> Fantastic, thank you. I'm not sure I knew about that page! >>>>>>>>> >>>>>>>>> For example, ?precede gives me the IRanges nearest-methods page, >>>>>>>>> and only if I explicitly request ?'nearest-methods' can I choose >>>>>>>>> the GenomicRanges page. That brings up another issue: ?distance >>>>>>>>> again sends me to the IRanges nearest-methods help, and that only >>>>>>>>> documents distanceToNearest, which is not enough. The spec for >>>>>>>>> distance actually lives only on >>>>>>>>> ?'precede,GenomicRanges,GenomicRanges-method', as far as I see. >>>>>>>>> >>>>>>>>> I can update the IRanges man page to include distance(). >>>>>>>>> >>>>>>>>> Finally, and admittedly a total edge case, but zero-width ranges >>>>>>>>> appear to be mishandled, or not handled, by the IRanges distance >>>>>>>>> code. I think they should be infinitely far from anything, or at >>>>>>>>> least NA, and there should not be any nearest or >>>>>>>>> distanceToNearest >>>>>>>>> value for a zero-width query. >>>>>>>>> cc'ing Herve and Michael for history. >>>>>>>>> >>>>>>>>> In general I agree that zero-width ranges should not have a >>>>>>>>> distance >>>>>> from any other range. I favor returning NA instead of inf since the >>>>>> range >>>>>> (albeit zero-width) does have a location. >>>>>>>>> >>>>>>>>> One concern I have is that we use zero-width ranges to represent >>>>>> insertions. In that context we may want to compute the distance >>>>>> to the >>>>>> nearest insertion. I do not have a concrete use case - just >>>>>> thinking of >>>>>> what might come. I'd be interested in opinions on this. >>>>>>>>> >>>>>>>>> Valerie >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> x >>>>>>>>> IRanges of length 1 >>>>>>>>> start end width >>>>>>>>> [1] 2 1 0 >>>>>>>>> >>>>>>>>> distance(x, x) >>>>>>>>> [1] 1 >>>>>>>>> >>>>>>>>> y >>>>>>>>> IRanges of length 1 >>>>>>>>> start end width >>>>>>>>> [1] 3 2 0 >>>>>>>>> >>>>>>>>> distance(x, y) >>>>>>>>> [1] 2 >>>>>>>>> >>>>>>>>> z >>>>>>>>> IRanges of length 1 >>>>>>>>> start end width >>>>>>>>> [1] 3 3 1 >>>>>>>>> >>>>>>>>> distance(x, z) >>>>>>>>> [1] 2 >>>>>>>>> >>>>>>>>> On Oct 15, 2012, at 6:27 PM, Valerie Obenchain wrote: >>>>>>>>> >>>>>>>>> A few follow up items from an off-list conversation with Harris. >>>>>>>>> >>>>>>>>> 1) I reversed the use of 5' and3' in my response below. Sorry >>>>>>>>> about >>>>>> that. >>>>>>>>> >>>>>>>>> 2) The primary landing page for ?precede is in IRanges. I've >>>>>> expanded >>>>>> the \seealso section of this page to point the user to the man >>>>>> page in >>>>>> GenomicRanges. >>>>>>>>> >>>>>>>>> 3) The precede/follow man page in GenomicRanges now specifically >>>>>> states that 5' to 3' is the relevant orientation for >>>>>> precede/follow. For >>>>>> those interested in taking a look, the man page has several aliases, >>>>>>>>> >>>>>>>>> ?'nearest-methods' >>>>>>>>> ?'precede,GenomicRanges,GenomicRanges-method' >>>>>>>>> ?'follow,GenomicRanges,GenomicRanges-method' >>>>>>>>> >>>>>>>>> Changes can be found in IRanges 1.17.3 and GenomicRanges 1.11.4. >>>>>>>>> >>>>>>>>> Thanks Harris. >>>>>>>>> >>>>>>>>> Valerie >>>>>>>>> >>>>>>>>> On 10/12/2012 04:17 PM, Valerie Obenchain wrote: >>>>>>>>> Hi Harris, >>>>>>>>> >>>>>>>>> On 10/11/2012 02:27 PM, Harris A. Jaffee wrote: >>>>>>>>> Apologies in advance if I just don't understand the ignore.strand >>>>>> switch, or >>>>>>>>> perhaps GRanges objects. Also, I have not even tried to >>>>>>>>> understand >>>>>>>>> >>>>>>>>> GenomicRanges:::.GenomicRanges_findPrecedeFollow >>>>>>>>> >>>>>>>>> I have the impression that a query with strand entirely "*" >>>>>> essentially implies >>>>>>>>> ignore.strand, but this case here is handled correctly only if >>>>>> ignore.strand is >>>>>>>>> TRUE (consistent with distanceToNearest but not distance, which >>>>>> seems >>>>>> correct): >>>>>>>>> >>>>>>>>> x = GRanges(ranges=IRanges(start=5, end=5), seqnames="chr1", >>>>>> strand="*") >>>>>>>>> Y = GRanges(ranges=IRanges(start=c(6,7), end=c(6,7)), >>>>>> seqnames="chr1", >>>>>> strand=c("+","-")) >>>>>>>>> distance(x, Y[1]) >>>>>>>>> [1] 1 >>>>>>>>> distance(x, Y[2]) >>>>>>>>> [1] 2 >>>>>>>>> >>>>>>>>> nearest(x, Y, ignore.strand=TRUE) # correct >>>>>>>>> [1] 1 >>>>>>>>> nearest(ranges(x), ranges(Y)) # also correct >>>>>>>>> [1] 1 >>>>>>>>> >>>>>>>>> However, >>>>>>>>> >>>>>>>>> nearest(x, Y) >>>>>>>>> [1] 2 >>>>>>>>> Thanks for reporting this bug. Now fixed in GenomicRanges >>>>>>>>> 1.11.3 and >>>>>> 1.10.2. >>>>>>>>> >>>>>>>>> The problem was in .nearest() where I computed the preceding and >>>>>> following distance of the ranges. I was missing abs() so the >>>>>> distance >>>>>> for >>>>>> the second range was -2 instead of 2. When checking to see which >>>>>> range >>>>>> was >>>>>> closest >>>>>>>>> >>>>>>>>> 1< -2 >>>>>>>>> [1] FALSE >>>>>>>>> >>>>>>>>> so the second range won. >>>>>>>>> distanceToNearest(x, Y) >>>>>>>>> DataFrame with 1 row and 3 columns >>>>>>>>> queryHits subjectHits distance >>>>>>>>> <integer> <integer> <integer> >>>>>>>>> 1 1 2 2 >>>>>>>>> >>>>>>>>> distanceToNearest(x, Y, ignore.strand=TRUE) >>>>>>>>> DataFrame with 1 row and 3 columns >>>>>>>>> queryHits subjectHits distance >>>>>>>>> <integer> <integer> <integer> >>>>>>>>> 1 1 1 1 >>>>>>>>> >>>>>>>>> >>>>>>>>> Finally, >>>>>>>>> >>>>>>>>> follow(ranges(x), ranges(Y)) >>>>>>>>> [1] NA >>>>>>>>> >>>>>>>>> The issue (along with GenomicRanges:::.nearest) must come down to >>>>>> this: >>>>>>>>> >>>>>>>>> follow(x, Y) # how can this be? >>>>>>>>> [1] 2 >>>>>>>>> This behavior is due to the fact that a '*' range can compare >>>>>>>>> itself >>>>>> to either '+' or '-'. The 'x' is '*" located at position 5. When >>>>>> comparing it to the '+' Y[1] >>>>>>>>> we think of both ranges as '+'. precede() and follow() >>>>>>>>> locations are >>>>>> determined by moving from 3' to 5'. On '+' strand this is from >>>>>> left to >>>>>> right so 5 precedes 6. >>>>>>>>> >>>>>>>>> precede(x, Y[1]) >>>>>>>>> [1] 1 >>>>>>>>> >>>>>>>>> but does not follow it >>>>>>>>> >>>>>>>>> follow(x, Y[1]) >>>>>>>>> [1] NA >>>>>>>>> >>>>>>>>> When comparing the '*' to Y[2] we think of both ranges as '-'. On >>>>>> the >>>>>> '-' moving from 3' to 5' is right to left so the 5 follows 7 >>>>>>>>> >>>>>>>>> follow(x, Y[2]) >>>>>>>>> [1] 1 >>>>>>>>> >>>>>>>>> but does not precede it >>>>>>>>> >>>>>>>>> precede(x, Y[2]) >>>>>>>>> [1] NA >>>>>>>>> >>>>>>>>> When we set ignore.strand=TRUE, all ranges are considered '+'. >>>>>>>>> >>>>>>>>> Valerie >>>>>>>>> whereas >>>>>>>>> >>>>>>>>> follow(x, Y, ignore.strand=TRUE) # correct, I think >>>>>>>>> [1] NA >>>>>>>>> >>>>>>>>> sessionInfo() >>>>>>>>> R Under development (unstable) (2012-10-10 r60908) >>>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>>>>> >>>>>>>>> locale: >>>>>>>>> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C >>>>>>>>> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 >>>>>>>>> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 >>>>>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>>>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C >>>>>>>>> >>>>>>>>> attached base packages: >>>>>>>>> [1] stats graphics grDevices datasets utils >>>>>>>>> methods base >>>>>>>>> >>>>>>>>> other attached packages: >>>>>>>>> [1] GenomicRanges_1.11.0 IRanges_1.17.0 BiocGenerics_0.5.0 >>>>>>>>> >>>>>>>>> loaded via a namespace (and not attached): >>>>>>>>> [1] parallel_2.16.0 stats4_2.16.0 tools_2.16.0 >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Bioconductor mailing list >>>>>>>>> Bioconductor at r-project.org >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>>> _______________________________________________ >>>>>>>>> Bioconductor mailing list >>>>>>>>> Bioconductor at r-project.org >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Hervé Pagès >>>>>>> >>>>>>> Program in Computational Biology >>>>>>> Division of Public Health Sciences >>>>>>> Fred Hutchinson Cancer Research Center >>>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>>> P.O. Box 19024 >>>>>>> Seattle, WA 98109-1024 >>>>>>> >>>>>>> E-mail: hpages at fhcrc.org >>>>>>> Phone: (206) 667-5791 >>>>>>> Fax: (206) 667-1319 >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> *A model is a lie that helps you see the truth.* >>>>> * >>>>> * >>>>> Howard >>>>> Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
We have probably overloaded the subject of this thread! I have to say I don't get it. Moreover, I believe it conflicts with the definition of IRanges, which (aside from empty ones, which for all I care can be relegated to a different class) has always been, to me, any one of: 1) the closed interval [start, end] 2) the vector start:end, as a set 3) the two numbers 'start' and 'end' distance(x, y) always meant the minimum absolute difference between the sets for x and y. This was/is intuitive and useful. If I understand the new picture, you are equating an IRanges (non- empty ones anyway) with the closed interval [start - .5, end + .5] and then applying to that the same minimum absolute difference spec. Why? Maybe you have made 'width' (start-end+1) more natural as the size of your new interval, or the number of blocks, but it was already the length of the vector start:end. If this change was because of something I said, I must have said it badly. On Oct 22, 2012, at 7:40 PM, Valerie Obenchain wrote: > The behavior of distance() has been changed in IRanges 1.7.17 and propagated to GenomicRanges 1.11.5. > > To handle zero-width ranges in a consistent and (hopefully) intuitive way we've gone with a 'block' model representation. A range is made up of blocks of size 1. The blocks are adjacent to each other and there is no gap between them. With this model the distance between 2 consecutive blocks is 0L (previously it was 1L). The distance function now returns the number of gaps between the blocks. This is all explained on the ?distance man page. If it is not clear please let me know. > > Highlights : > - distance() now returns 1 less than what it did in <= IRanges 1.7.16. > - Adjacent or overlapping ranges return a distance of 0L. > - A temporary warning is emitted from distance,Ranges,Ranges which will be removed when Bioconductor 2.13 is released. > - distance(x, y) now allows 'x' and 'y' to have different lengths (shortest recycled to length of longest). > > Valerie > > On 10/17/2012 04:02 PM, Valerie Obenchain wrote: >> Thanks Herve for taking care of compare(). >> >> Harris, you mentioned a 'signed' option for distance(). Tim is working on a patch for distanceToNearest() for GRanges that will return signed distances. The better place to make this change would be at the distance() level. The two of you may want to join forces on this? >> >> I think we now agree on the following, >> >> (1) distance() should return 0 when zero-width ranges are compared. >> >> distance(IRanges(2,1), IRanges(2,1)) ## should return 0 >> >> (2) distance() between a zero-width and non-zero-width range will be computed as always, i.e., Herve's examples hold where all distances are equal to 8, >> >> distance(IRanges(1,2), IRanges(10,16)) >> distance(IRanges(10,16), IRanges(1,2)) >> distance(IRanges(1,2), IRanges(10,9)) >> distance(IRanges(10,9),IRanges(1,2)) >> distance(IRanges(3,2), IRanges(10,16)) >> distance(IRanges(10,16), IRanges(3,2)) >> distance(IRanges(3,2), IRanges(10,9)) >> distance(IRanges(10,9), IRanges(3,2)) >> >> I will make the necessary changes to distance(). Thanks for the input. >> >> Valerie >> >> On 10/17/2012 10:41 AM, Hervé Pagès wrote: >>> Hi, >>> >>> On 10/17/2012 09:22 AM, Harris A. Jaffee wrote: >>>> On Oct 17, 2012, at 11:31 AM, Cook, Malcolm wrote: >>>> >>>>> Indeed, for distance to behave as a metric it must be that distance(x,x) >>>>> == 0 for all x >>>>> >>>>> Here is some further (consistent) oddness: >>>>> >>>>>> compare(GRanges('x',IRanges(4,3)),GRanges('x',IRanges(4,3))) >>>>> [1] -5 >>>> >>>> Thanks for the tip on compare. Looks like a nice classification scheme. >>>> I should start reading vignettes! >>>> >>>>> I would expect the value to be 0 here. Comparing x with itself should >>>>> yield 0. After all they are identical! >>>> >>>> Yeah, identical but *empty*. Therefore, none of the scenarios depicted >>>> in the man page seems appropriate (even 0/g): >>>> >>>> -6 a: x[i]: .oooo....... 6 m: x[i]: .......oooo. >>>> y[i]: .......oooo. y[i]: .oooo....... >>>> >>>> -5 b: x[i]: ..oooo...... 5 l: x[i]: ......oooo.. >>>> y[i]: ......oooo.. y[i]: ..oooo...... >>>> >>>> -4 c: x[i]: ...oooo..... 4 k: x[i]: .....oooo... >>>> y[i]: .....oooo... y[i]: ...oooo..... >>>> >>>> -3 d: x[i]: ...oooooo... 3 j: x[i]: .....oooo... >>>> y[i]: .....oooo... y[i]: ...oooooo... >>>> >>>> -2 e: x[i]: ..oooooooo.. 2 i: x[i]: ....oooo.... >>>> y[i]: ....oooo.... y[i]: ..oooooooo.. >>>> >>>> -1 f: x[i]: ...oooo..... 1 h: x[i]: ...oooooo... >>>> y[i]: ...oooooo... y[i]: ...oooo..... >>>> >>>> 0 g: x[i]: ...oooooo... >>>> y[i]: ...oooooo... >>>> >>>> >>>> Again, I like NA in these cases. >>> >>> compare() should be consistent with the comparison binary operators >>> (==, !=, <=, <, >=, >), in the sense that we should have: >>> >>> x == y <=> compare(x, y) == 0 >>> x != y <=> compare(x, y) != 0 >>> x <= y <=> compare(x, y) <= 0 >>> etc... >>> >>> Right now those operations are actually using compare() behind the >>> scene when the operands are GRanges objects: >>> >>> > gr <- GRanges('x',IRanges(4,3)) >>> > gr == gr >>> [1] FALSE >>> >>> So compare(x, x) needs to be fixed to always return 0. >>> >>> Note that the comparison binary operators on Ranges objects are >>> not based on compare() yet so even though compare(ir, ir) is not >>> 0 here, we get a TRUE for ir == ir: >>> >>> > compare(ir, ir) >>> [1] -5 >>> > ir == ir >>> [1] TRUE >>> >>> I have on my list to change the current implementation of the >>> comparison binary operators on Ranges objects so they also use >>> compare() behind the scene (that will make them about twice >>> faster because compare() is implemented in C). >>> >>> I'll fix compare() right away. Thanks for the feedback! >>> >>> H. >>> >>> >>>> >>>> >>>>> --Malcolm >>>>> >>>>> >>>>> On 10/17/12 9:50 AM, "Tim Triche, Jr." <tim.triche at="" gmail.com=""> wrote: >>>>> >>>>>> Hmm, that's odd. And it seems inconsistent with the usual distance() and >>>>>> distanceToNearest() methods, which would return 0 in this case, yes? >>>>>> >>>>>> e.g. if the IRanges were GRanges and sitting on top of each other on the >>>>>> same strand, or if the first was distance()'ed against itself (the same >>>>>> thing) >>>>>> >>>>>> That doesn't make sense to me either. This doesn't seem like an issue of >>>>>> theology to me, but rather inconsistency with what the docs say will >>>>>> happen. >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Oct 17, 2012 at 7:31 AM, Harris A. Jaffee <hj at="" jhu.edu=""> wrote: >>>>>> >>>>>>> As I said, any rule is fine, especially if they can serve a purpose >>>>>>> via their "location". But you also have this, which seems weird: >>>>>>> >>>>>>>> distance(IRanges(2,1), IRanges(2,1)) >>>>>>> [1] 1 >>>>>>> >>>>>>> To say they are not treated any differently is a big stretch to me, >>>>>>> since their defining formula (end = start-1) is already a violation >>>>>>> of my senses. [I wonder if SEW = (start, NA, 0) wouldn't have been >>>>>>> more useful.] >>>>>>> >>>>>>> Just so we don't continue off into useless theology, I should say that >>>>>>> I started looking into the distance code with the idea of adding a >>>>>>> "signed" option (say, distance(x,y)<0 iff x is downstream from y), and >>>>>>> I wanted to understand the landscape first. >>>>>>> >>>>>>> On Oct 17, 2012, at 12:56 AM, Hervé Pagès wrote: >>>>>>> >>>>>>>> Hi Harris, Val, Michael, >>>>>>>> >>>>>>>> Right now, all those distances are equal to 8, which I think is what >>>>>>>> we want: >>>>>>>> >>>>>>>> distance(IRanges(1,2), IRanges(10,16)) >>>>>>>> distance(IRanges(10,16), IRanges(1,2)) >>>>>>>> distance(IRanges(1,2), IRanges(10,9)) >>>>>>>> distance(IRanges(10,9),IRanges(1,2)) >>>>>>>> distance(IRanges(3,2), IRanges(10,16)) >>>>>>>> distance(IRanges(10,16), IRanges(3,2)) >>>>>>>> distance(IRanges(3,2), IRanges(10,9)) >>>>>>>> distance(IRanges(10,9), IRanges(3,2)) >>>>>>>> >>>>>>>> Yes, from a mathematical point of view, the distance between a >>>>>>>> non-empty set and an empty set if not defined, but I'm not sure there >>>>>>>> would be much to gain in doing this. Some use cases would be needed. >>>>>>>> What's nice about the current behavior is that zero-width ranges are >>>>>>>> not treated any differently than other ranges. >>>>>>>> >>>>>>>> H. >>>>>>>> >>>>>>>> >>>>>>>> On 10/16/2012 12:21 PM, Harris A. Jaffee wrote: >>>>>>>>> Any rule is fine with me. They would not occur for us except by >>>>>>> mistake. >>>>>>>>> I was basically raising an alert that the current situation is not >>>>>>> quite >>>>>>>>> complete. >>>>>>>>> >>>>>>>>> On Oct 16, 2012, at 2:53 PM, Michael Lawrence wrote: >>>>>>>>> >>>>>>>>>> Since these zero-width ranges do have a position, I'm not sure why >>>>>>> we >>>>>>> cannot calculate their distance. As long as we have an established rule >>>>>>> about how we handle them. For example, is it just from the start? That's >>>>>>> probably most intuitive. >>>>>>>>>> >>>>>>>>>> Michael >>>>>>>>>> >>>>>>>>>> On Tue, Oct 16, 2012 at 11:25 AM, Valerie Obenchain < >>>>>>> vobencha at fhcrc.org> wrote: >>>>>>>>>> On 10/16/2012 09:25 AM, Harris A. Jaffee wrote: >>>>>>>>>> Fantastic, thank you. I'm not sure I knew about that page! >>>>>>>>>> >>>>>>>>>> For example, ?precede gives me the IRanges nearest-methods page, >>>>>>>>>> and only if I explicitly request ?'nearest-methods' can I choose >>>>>>>>>> the GenomicRanges page. That brings up another issue: ?distance >>>>>>>>>> again sends me to the IRanges nearest-methods help, and that only >>>>>>>>>> documents distanceToNearest, which is not enough. The spec for >>>>>>>>>> distance actually lives only on >>>>>>>>>> ?'precede,GenomicRanges,GenomicRanges-method', as far as I see. >>>>>>>>>> >>>>>>>>>> I can update the IRanges man page to include distance(). >>>>>>>>>> >>>>>>>>>> Finally, and admittedly a total edge case, but zero-width ranges >>>>>>>>>> appear to be mishandled, or not handled, by the IRanges distance >>>>>>>>>> code. I think they should be infinitely far from anything, or at >>>>>>>>>> least NA, and there should not be any nearest or distanceToNearest >>>>>>>>>> value for a zero-width query. >>>>>>>>>> cc'ing Herve and Michael for history. >>>>>>>>>> >>>>>>>>>> In general I agree that zero-width ranges should not have a distance >>>>>>> from any other range. I favor returning NA instead of inf since the >>>>>>> range >>>>>>> (albeit zero-width) does have a location. >>>>>>>>>> >>>>>>>>>> One concern I have is that we use zero-width ranges to represent >>>>>>> insertions. In that context we may want to compute the distance to the >>>>>>> nearest insertion. I do not have a concrete use case - just thinking of >>>>>>> what might come. I'd be interested in opinions on this. >>>>>>>>>> >>>>>>>>>> Valerie >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> x >>>>>>>>>> IRanges of length 1 >>>>>>>>>> start end width >>>>>>>>>> [1] 2 1 0 >>>>>>>>>> >>>>>>>>>> distance(x, x) >>>>>>>>>> [1] 1 >>>>>>>>>> >>>>>>>>>> y >>>>>>>>>> IRanges of length 1 >>>>>>>>>> start end width >>>>>>>>>> [1] 3 2 0 >>>>>>>>>> >>>>>>>>>> distance(x, y) >>>>>>>>>> [1] 2 >>>>>>>>>> >>>>>>>>>> z >>>>>>>>>> IRanges of length 1 >>>>>>>>>> start end width >>>>>>>>>> [1] 3 3 1 >>>>>>>>>> >>>>>>>>>> distance(x, z) >>>>>>>>>> [1] 2 >>>>>>>>>> >>>>>>>>>> On Oct 15, 2012, at 6:27 PM, Valerie Obenchain wrote: >>>>>>>>>> >>>>>>>>>> A few follow up items from an off-list conversation with Harris. >>>>>>>>>> >>>>>>>>>> 1) I reversed the use of 5' and3' in my response below. Sorry about >>>>>>> that. >>>>>>>>>> >>>>>>>>>> 2) The primary landing page for ?precede is in IRanges. I've >>>>>>> expanded >>>>>>> the \seealso section of this page to point the user to the man page in >>>>>>> GenomicRanges. >>>>>>>>>> >>>>>>>>>> 3) The precede/follow man page in GenomicRanges now specifically >>>>>>> states that 5' to 3' is the relevant orientation for precede/follow. For >>>>>>> those interested in taking a look, the man page has several aliases, >>>>>>>>>> >>>>>>>>>> ?'nearest-methods' >>>>>>>>>> ?'precede,GenomicRanges,GenomicRanges-method' >>>>>>>>>> ?'follow,GenomicRanges,GenomicRanges-method' >>>>>>>>>> >>>>>>>>>> Changes can be found in IRanges 1.17.3 and GenomicRanges 1.11.4. >>>>>>>>>> >>>>>>>>>> Thanks Harris. >>>>>>>>>> >>>>>>>>>> Valerie >>>>>>>>>> >>>>>>>>>> On 10/12/2012 04:17 PM, Valerie Obenchain wrote: >>>>>>>>>> Hi Harris, >>>>>>>>>> >>>>>>>>>> On 10/11/2012 02:27 PM, Harris A. Jaffee wrote: >>>>>>>>>> Apologies in advance if I just don't understand the ignore.strand >>>>>>> switch, or >>>>>>>>>> perhaps GRanges objects. Also, I have not even tried to understand >>>>>>>>>> >>>>>>>>>> GenomicRanges:::.GenomicRanges_findPrecedeFollow >>>>>>>>>> >>>>>>>>>> I have the impression that a query with strand entirely "*" >>>>>>> essentially implies >>>>>>>>>> ignore.strand, but this case here is handled correctly only if >>>>>>> ignore.strand is >>>>>>>>>> TRUE (consistent with distanceToNearest but not distance, which >>>>>>> seems >>>>>>> correct): >>>>>>>>>> >>>>>>>>>> x = GRanges(ranges=IRanges(start=5, end=5), seqnames="chr1", >>>>>>> strand="*") >>>>>>>>>> Y = GRanges(ranges=IRanges(start=c(6,7), end=c(6,7)), >>>>>>> seqnames="chr1", >>>>>>> strand=c("+","-")) >>>>>>>>>> distance(x, Y[1]) >>>>>>>>>> [1] 1 >>>>>>>>>> distance(x, Y[2]) >>>>>>>>>> [1] 2 >>>>>>>>>> >>>>>>>>>> nearest(x, Y, ignore.strand=TRUE) # correct >>>>>>>>>> [1] 1 >>>>>>>>>> nearest(ranges(x), ranges(Y)) # also correct >>>>>>>>>> [1] 1 >>>>>>>>>> >>>>>>>>>> However, >>>>>>>>>> >>>>>>>>>> nearest(x, Y) >>>>>>>>>> [1] 2 >>>>>>>>>> Thanks for reporting this bug. Now fixed in GenomicRanges 1.11.3 and >>>>>>> 1.10.2. >>>>>>>>>> >>>>>>>>>> The problem was in .nearest() where I computed the preceding and >>>>>>> following distance of the ranges. I was missing abs() so the distance >>>>>>> for >>>>>>> the second range was -2 instead of 2. When checking to see which range >>>>>>> was >>>>>>> closest >>>>>>>>>> >>>>>>>>>> 1< -2 >>>>>>>>>> [1] FALSE >>>>>>>>>> >>>>>>>>>> so the second range won. >>>>>>>>>> distanceToNearest(x, Y) >>>>>>>>>> DataFrame with 1 row and 3 columns >>>>>>>>>> queryHits subjectHits distance >>>>>>>>>> <integer> <integer> <integer> >>>>>>>>>> 1 1 2 2 >>>>>>>>>> >>>>>>>>>> distanceToNearest(x, Y, ignore.strand=TRUE) >>>>>>>>>> DataFrame with 1 row and 3 columns >>>>>>>>>> queryHits subjectHits distance >>>>>>>>>> <integer> <integer> <integer> >>>>>>>>>> 1 1 1 1 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Finally, >>>>>>>>>> >>>>>>>>>> follow(ranges(x), ranges(Y)) >>>>>>>>>> [1] NA >>>>>>>>>> >>>>>>>>>> The issue (along with GenomicRanges:::.nearest) must come down to >>>>>>> this: >>>>>>>>>> >>>>>>>>>> follow(x, Y) # how can this be? >>>>>>>>>> [1] 2 >>>>>>>>>> This behavior is due to the fact that a '*' range can compare itself >>>>>>> to either '+' or '-'. The 'x' is '*" located at position 5. When >>>>>>> comparing it to the '+' Y[1] >>>>>>>>>> we think of both ranges as '+'. precede() and follow() locations are >>>>>>> determined by moving from 3' to 5'. On '+' strand this is from left to >>>>>>> right so 5 precedes 6. >>>>>>>>>> >>>>>>>>>> precede(x, Y[1]) >>>>>>>>>> [1] 1 >>>>>>>>>> >>>>>>>>>> but does not follow it >>>>>>>>>> >>>>>>>>>> follow(x, Y[1]) >>>>>>>>>> [1] NA >>>>>>>>>> >>>>>>>>>> When comparing the '*' to Y[2] we think of both ranges as '-'. On >>>>>>> the >>>>>>> '-' moving from 3' to 5' is right to left so the 5 follows 7 >>>>>>>>>> >>>>>>>>>> follow(x, Y[2]) >>>>>>>>>> [1] 1 >>>>>>>>>> >>>>>>>>>> but does not precede it >>>>>>>>>> >>>>>>>>>> precede(x, Y[2]) >>>>>>>>>> [1] NA >>>>>>>>>> >>>>>>>>>> When we set ignore.strand=TRUE, all ranges are considered '+'. >>>>>>>>>> >>>>>>>>>> Valerie >>>>>>>>>> whereas >>>>>>>>>> >>>>>>>>>> follow(x, Y, ignore.strand=TRUE) # correct, I think >>>>>>>>>> [1] NA >>>>>>>>>> >>>>>>>>>> sessionInfo() >>>>>>>>>> R Under development (unstable) (2012-10-10 r60908) >>>>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>>>>>> >>>>>>>>>> locale: >>>>>>>>>> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C >>>>>>>>>> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 >>>>>>>>>> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 >>>>>>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>>>>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C >>>>>>>>>> >>>>>>>>>> attached base packages: >>>>>>>>>> [1] stats graphics grDevices datasets utils methods base >>>>>>>>>> >>>>>>>>>> other attached packages: >>>>>>>>>> [1] GenomicRanges_1.11.0 IRanges_1.17.0 BiocGenerics_0.5.0 >>>>>>>>>> >>>>>>>>>> loaded via a namespace (and not attached): >>>>>>>>>> [1] parallel_2.16.0 stats4_2.16.0 tools_2.16.0 >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Bioconductor mailing list >>>>>>>>>> Bioconductor at r-project.org >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>>> Search the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>>>> _______________________________________________ >>>>>>>>>> Bioconductor mailing list >>>>>>>>>> Bioconductor at r-project.org >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>>> Search the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Hervé Pagès >>>>>>>> >>>>>>>> Program in Computational Biology >>>>>>>> Division of Public Health Sciences >>>>>>>> Fred Hutchinson Cancer Research Center >>>>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>>>> P.O. Box 19024 >>>>>>>> Seattle, WA 98109-1024 >>>>>>>> >>>>>>>> E-mail: hpages at fhcrc.org >>>>>>>> Phone: (206) 667-5791 >>>>>>>> Fax: (206) 667-1319 >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> *A model is a lie that helps you see the truth.* >>>>>> * >>>>>> * >>>>>> Howard >>>>>> Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >>>>>> >>>>>> [[alternative HTML version deleted]] >>>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
Hi Harris, On 10/24/2012 09:25 AM, Harris A. Jaffee wrote: > We have probably overloaded the subject of this thread! > > I have to say I don't get it. Moreover, I believe it conflicts with the > definition of IRanges, which (aside from empty ones, which for all I care can > be relegated to a different class) has always been, to me, any one of: > > 1) the closed interval [start, end] > 2) the vector start:end, as a set > 3) the two numbers 'start' and 'end' > > distance(x, y) always meant the minimum absolute difference between the sets > for x and y. This was/is intuitive and useful. > > If I understand the new picture, you are equating an IRanges (non- empty ones > anyway) with the closed interval > > [start - .5, end + .5] > > and then applying to that the same minimum absolute difference spec. Why? > > Maybe you have made 'width' (start-end+1) more natural as the size of your > new interval, or the number of blocks, but it was already the length of the > vector start:end. > > If this change was because of something I said, I must have said it badly. Just to recap - The change was motivated by the discovery that distance between a zero-width range and itself was 1. IRanges 1.16.3 in release: > distance(IRanges(4,3), IRanges(4,3)) [1] 1 If we made self-hits a distance of 0 (consistent with non-zero width ranges) we would then have a situation where distance dropped 2 units after only moving 1 unit. > sapply(-2:2, function(i) distance(shift(IRanges(4,3), i), IRanges(4,3))) [1] 3 2 0 2 3 I looked into changing the distance() code and found that many 'special case' (if/else) situations would be needed to handle the zero-width ranges. Zero-width may be the edge case for some but could be mainstream for others. Their role in representing insertions is an important one and we'd like their treatment to be as robust as possible. That said, we wanted a consistent set of rules that would apply to all ranges, zero- or non-zero width. We decided on a 'block' model. For those who haven't seen the man page, a visual representation of IRanges(3,6) would be +-----+-----+-----+-----+ "block" model 3 4 5 6 instead of: +-----+-----+-----+ "isolated dots" model 3 4 5 6 This model was intutitive from our point of view in the sense that there is a unit ("block") associated with the distance. The distance between ranges is X blocks. This also allow us to treat all ranges equally. The new distance() has a smooth transition down to a self-hit of 0 for zero-width ranges. IRanges 1.17.7 in devel: > sapply(-2:2, function(i) distance(shift(IRanges(4,3), i), IRanges(4,3))) [1] 2 1 0 1 2 Warning messages: 1: In .local(x, y, ...) : The behavior of distance() has changed in Bioconductor 2.12. See ?distance for details. ... This change to distance will be most noticeable for users with code that depended on the distance between adjacent ranges being 1. Now adjacent ranges have a distance of 0. Aside from that, distance has changed by 1 which we thought should not have much affect on existing code. Valerie > > On Oct 22, 2012, at 7:40 PM, Valerie Obenchain wrote: > >> The behavior of distance() has been changed in IRanges 1.7.17 and propagated to GenomicRanges 1.11.5. >> >> To handle zero-width ranges in a consistent and (hopefully) intuitive way we've gone with a 'block' model representation. A range is made up of blocks of size 1. The blocks are adjacent to each other and there is no gap between them. With this model the distance between 2 consecutive blocks is 0L (previously it was 1L). The distance function now returns the number of gaps between the blocks. This is all explained on the ?distance man page. If it is not clear please let me know. >> >> Highlights : >> - distance() now returns 1 less than what it did in<= IRanges 1.7.16. >> - Adjacent or overlapping ranges return a distance of 0L. >> - A temporary warning is emitted from distance,Ranges,Ranges which will be removed when Bioconductor 2.13 is released. >> - distance(x, y) now allows 'x' and 'y' to have different lengths (shortest recycled to length of longest). >> >> Valerie >> >> On 10/17/2012 04:02 PM, Valerie Obenchain wrote: >>> Thanks Herve for taking care of compare(). >>> >>> Harris, you mentioned a 'signed' option for distance(). Tim is working on a patch for distanceToNearest() for GRanges that will return signed distances. The better place to make this change would be at the distance() level. The two of you may want to join forces on this? >>> >>> I think we now agree on the following, >>> >>> (1) distance() should return 0 when zero-width ranges are compared. >>> >>> distance(IRanges(2,1), IRanges(2,1)) ## should return 0 >>> >>> (2) distance() between a zero-width and non-zero-width range will be computed as always, i.e., Herve's examples hold where all distances are equal to 8, >>> >>> distance(IRanges(1,2), IRanges(10,16)) >>> distance(IRanges(10,16), IRanges(1,2)) >>> distance(IRanges(1,2), IRanges(10,9)) >>> distance(IRanges(10,9),IRanges(1,2)) >>> distance(IRanges(3,2), IRanges(10,16)) >>> distance(IRanges(10,16), IRanges(3,2)) >>> distance(IRanges(3,2), IRanges(10,9)) >>> distance(IRanges(10,9), IRanges(3,2)) >>> >>> I will make the necessary changes to distance(). Thanks for the input. >>> >>> Valerie >>> >>> On 10/17/2012 10:41 AM, Hervé Pagès wrote: >>>> Hi, >>>> >>>> On 10/17/2012 09:22 AM, Harris A. Jaffee wrote: >>>>> On Oct 17, 2012, at 11:31 AM, Cook, Malcolm wrote: >>>>> >>>>>> Indeed, for distance to behave as a metric it must be that distance(x,x) >>>>>> == 0 for all x >>>>>> >>>>>> Here is some further (consistent) oddness: >>>>>> >>>>>>> compare(GRanges('x',IRanges(4,3)),GRanges('x',IRanges(4,3))) >>>>>> [1] -5 >>>>> >>>>> Thanks for the tip on compare. Looks like a nice classification scheme. >>>>> I should start reading vignettes! >>>>> >>>>>> I would expect the value to be 0 here. Comparing x with itself should >>>>>> yield 0. After all they are identical! >>>>> >>>>> Yeah, identical but *empty*. Therefore, none of the scenarios depicted >>>>> in the man page seems appropriate (even 0/g): >>>>> >>>>> -6 a: x[i]: .oooo....... 6 m: x[i]: .......oooo. >>>>> y[i]: .......oooo. y[i]: .oooo....... >>>>> >>>>> -5 b: x[i]: ..oooo...... 5 l: x[i]: ......oooo.. >>>>> y[i]: ......oooo.. y[i]: ..oooo...... >>>>> >>>>> -4 c: x[i]: ...oooo..... 4 k: x[i]: .....oooo... >>>>> y[i]: .....oooo... y[i]: ...oooo..... >>>>> >>>>> -3 d: x[i]: ...oooooo... 3 j: x[i]: .....oooo... >>>>> y[i]: .....oooo... y[i]: ...oooooo... >>>>> >>>>> -2 e: x[i]: ..oooooooo.. 2 i: x[i]: ....oooo.... >>>>> y[i]: ....oooo.... y[i]: ..oooooooo.. >>>>> >>>>> -1 f: x[i]: ...oooo..... 1 h: x[i]: ...oooooo... >>>>> y[i]: ...oooooo... y[i]: ...oooo..... >>>>> >>>>> 0 g: x[i]: ...oooooo... >>>>> y[i]: ...oooooo... >>>>> >>>>> >>>>> Again, I like NA in these cases. >>>> >>>> compare() should be consistent with the comparison binary operators >>>> (==, !=,<=,<,>=,>), in the sense that we should have: >>>> >>>> x == y<=> compare(x, y) == 0 >>>> x != y<=> compare(x, y) != 0 >>>> x<= y<=> compare(x, y)<= 0 >>>> etc... >>>> >>>> Right now those operations are actually using compare() behind the >>>> scene when the operands are GRanges objects: >>>> >>>>> gr<- GRanges('x',IRanges(4,3)) >>>>> gr == gr >>>> [1] FALSE >>>> >>>> So compare(x, x) needs to be fixed to always return 0. >>>> >>>> Note that the comparison binary operators on Ranges objects are >>>> not based on compare() yet so even though compare(ir, ir) is not >>>> 0 here, we get a TRUE for ir == ir: >>>> >>>>> compare(ir, ir) >>>> [1] -5 >>>>> ir == ir >>>> [1] TRUE >>>> >>>> I have on my list to change the current implementation of the >>>> comparison binary operators on Ranges objects so they also use >>>> compare() behind the scene (that will make them about twice >>>> faster because compare() is implemented in C). >>>> >>>> I'll fix compare() right away. Thanks for the feedback! >>>> >>>> H. >>>> >>>> >>>>> >>>>> >>>>>> --Malcolm >>>>>> >>>>>> >>>>>> On 10/17/12 9:50 AM, "Tim Triche, Jr."<tim.triche at="" gmail.com=""> wrote: >>>>>> >>>>>>> Hmm, that's odd. And it seems inconsistent with the usual distance() and >>>>>>> distanceToNearest() methods, which would return 0 in this case, yes? >>>>>>> >>>>>>> e.g. if the IRanges were GRanges and sitting on top of each other on the >>>>>>> same strand, or if the first was distance()'ed against itself (the same >>>>>>> thing) >>>>>>> >>>>>>> That doesn't make sense to me either. This doesn't seem like an issue of >>>>>>> theology to me, but rather inconsistency with what the docs say will >>>>>>> happen. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Oct 17, 2012 at 7:31 AM, Harris A. Jaffee<hj at="" jhu.edu=""> wrote: >>>>>>> >>>>>>>> As I said, any rule is fine, especially if they can serve a purpose >>>>>>>> via their "location". But you also have this, which seems weird: >>>>>>>> >>>>>>>>> distance(IRanges(2,1), IRanges(2,1)) >>>>>>>> [1] 1 >>>>>>>> >>>>>>>> To say they are not treated any differently is a big stretch to me, >>>>>>>> since their defining formula (end = start-1) is already a violation >>>>>>>> of my senses. [I wonder if SEW = (start, NA, 0) wouldn't have been >>>>>>>> more useful.] >>>>>>>> >>>>>>>> Just so we don't continue off into useless theology, I should say that >>>>>>>> I started looking into the distance code with the idea of adding a >>>>>>>> "signed" option (say, distance(x,y)<0 iff x is downstream from y), and >>>>>>>> I wanted to understand the landscape first. >>>>>>>> >>>>>>>> On Oct 17, 2012, at 12:56 AM, Hervé Pagès wrote: >>>>>>>> >>>>>>>>> Hi Harris, Val, Michael, >>>>>>>>> >>>>>>>>> Right now, all those distances are equal to 8, which I think is what >>>>>>>>> we want: >>>>>>>>> >>>>>>>>> distance(IRanges(1,2), IRanges(10,16)) >>>>>>>>> distance(IRanges(10,16), IRanges(1,2)) >>>>>>>>> distance(IRanges(1,2), IRanges(10,9)) >>>>>>>>> distance(IRanges(10,9),IRanges(1,2)) >>>>>>>>> distance(IRanges(3,2), IRanges(10,16)) >>>>>>>>> distance(IRanges(10,16), IRanges(3,2)) >>>>>>>>> distance(IRanges(3,2), IRanges(10,9)) >>>>>>>>> distance(IRanges(10,9), IRanges(3,2)) >>>>>>>>> >>>>>>>>> Yes, from a mathematical point of view, the distance between a >>>>>>>>> non-empty set and an empty set if not defined, but I'm not sure there >>>>>>>>> would be much to gain in doing this. Some use cases would be needed. >>>>>>>>> What's nice about the current behavior is that zero-width ranges are >>>>>>>>> not treated any differently than other ranges. >>>>>>>>> >>>>>>>>> H. >>>>>>>>> >>>>>>>>> >>>>>>>>> On 10/16/2012 12:21 PM, Harris A. Jaffee wrote: >>>>>>>>>> Any rule is fine with me. They would not occur for us except by >>>>>>>> mistake. >>>>>>>>>> I was basically raising an alert that the current situation is not >>>>>>>> quite >>>>>>>>>> complete. >>>>>>>>>> >>>>>>>>>> On Oct 16, 2012, at 2:53 PM, Michael Lawrence wrote: >>>>>>>>>> >>>>>>>>>>> Since these zero-width ranges do have a position, I'm not sure why >>>>>>>> we >>>>>>>> cannot calculate their distance. As long as we have an established rule >>>>>>>> about how we handle them. For example, is it just from the start? That's >>>>>>>> probably most intuitive. >>>>>>>>>>> >>>>>>>>>>> Michael >>>>>>>>>>> >>>>>>>>>>> On Tue, Oct 16, 2012 at 11:25 AM, Valerie Obenchain< >>>>>>>> vobencha at fhcrc.org> wrote: >>>>>>>>>>> On 10/16/2012 09:25 AM, Harris A. Jaffee wrote: >>>>>>>>>>> Fantastic, thank you. I'm not sure I knew about that page! >>>>>>>>>>> >>>>>>>>>>> For example, ?precede gives me the IRanges nearest-methods page, >>>>>>>>>>> and only if I explicitly request ?'nearest-methods' can I choose >>>>>>>>>>> the GenomicRanges page. That brings up another issue: ?distance >>>>>>>>>>> again sends me to the IRanges nearest-methods help, and that only >>>>>>>>>>> documents distanceToNearest, which is not enough. The spec for >>>>>>>>>>> distance actually lives only on >>>>>>>>>>> ?'precede,GenomicRanges,GenomicRanges-method', as far as I see. >>>>>>>>>>> >>>>>>>>>>> I can update the IRanges man page to include distance(). >>>>>>>>>>> >>>>>>>>>>> Finally, and admittedly a total edge case, but zero-width ranges >>>>>>>>>>> appear to be mishandled, or not handled, by the IRanges distance >>>>>>>>>>> code. I think they should be infinitely far from anything, or at >>>>>>>>>>> least NA, and there should not be any nearest or distanceToNearest >>>>>>>>>>> value for a zero-width query. >>>>>>>>>>> cc'ing Herve and Michael for history. >>>>>>>>>>> >>>>>>>>>>> In general I agree that zero-width ranges should not have a distance >>>>>>>> from any other range. I favor returning NA instead of inf since the >>>>>>>> range >>>>>>>> (albeit zero-width) does have a location. >>>>>>>>>>> >>>>>>>>>>> One concern I have is that we use zero-width ranges to represent >>>>>>>> insertions. In that context we may want to compute the distance to the >>>>>>>> nearest insertion. I do not have a concrete use case - just thinking of >>>>>>>> what might come. I'd be interested in opinions on this. >>>>>>>>>>> >>>>>>>>>>> Valerie >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> x >>>>>>>>>>> IRanges of length 1 >>>>>>>>>>> start end width >>>>>>>>>>> [1] 2 1 0 >>>>>>>>>>> >>>>>>>>>>> distance(x, x) >>>>>>>>>>> [1] 1 >>>>>>>>>>> >>>>>>>>>>> y >>>>>>>>>>> IRanges of length 1 >>>>>>>>>>> start end width >>>>>>>>>>> [1] 3 2 0 >>>>>>>>>>> >>>>>>>>>>> distance(x, y) >>>>>>>>>>> [1] 2 >>>>>>>>>>> >>>>>>>>>>> z >>>>>>>>>>> IRanges of length 1 >>>>>>>>>>> start end width >>>>>>>>>>> [1] 3 3 1 >>>>>>>>>>> >>>>>>>>>>> distance(x, z) >>>>>>>>>>> [1] 2 >>>>>>>>>>> >>>>>>>>>>> On Oct 15, 2012, at 6:27 PM, Valerie Obenchain wrote: >>>>>>>>>>> >>>>>>>>>>> A few follow up items from an off-list conversation with Harris. >>>>>>>>>>> >>>>>>>>>>> 1) I reversed the use of 5' and3' in my response below. Sorry about >>>>>>>> that. >>>>>>>>>>> >>>>>>>>>>> 2) The primary landing page for ?precede is in IRanges. I've >>>>>>>> expanded >>>>>>>> the \seealso section of this page to point the user to the man page in >>>>>>>> GenomicRanges. >>>>>>>>>>> >>>>>>>>>>> 3) The precede/follow man page in GenomicRanges now specifically >>>>>>>> states that 5' to 3' is the relevant orientation for precede/follow. For >>>>>>>> those interested in taking a look, the man page has several aliases, >>>>>>>>>>> >>>>>>>>>>> ?'nearest-methods' >>>>>>>>>>> ?'precede,GenomicRanges,GenomicRanges-method' >>>>>>>>>>> ?'follow,GenomicRanges,GenomicRanges-method' >>>>>>>>>>> >>>>>>>>>>> Changes can be found in IRanges 1.17.3 and GenomicRanges 1.11.4. >>>>>>>>>>> >>>>>>>>>>> Thanks Harris. >>>>>>>>>>> >>>>>>>>>>> Valerie >>>>>>>>>>> >>>>>>>>>>> On 10/12/2012 04:17 PM, Valerie Obenchain wrote: >>>>>>>>>>> Hi Harris, >>>>>>>>>>> >>>>>>>>>>> On 10/11/2012 02:27 PM, Harris A. Jaffee wrote: >>>>>>>>>>> Apologies in advance if I just don't understand the ignore.strand >>>>>>>> switch, or >>>>>>>>>>> perhaps GRanges objects. Also, I have not even tried to understand >>>>>>>>>>> >>>>>>>>>>> GenomicRanges:::.GenomicRanges_findPrecedeFollow >>>>>>>>>>> >>>>>>>>>>> I have the impression that a query with strand entirely "*" >>>>>>>> essentially implies >>>>>>>>>>> ignore.strand, but this case here is handled correctly only if >>>>>>>> ignore.strand is >>>>>>>>>>> TRUE (consistent with distanceToNearest but not distance, which >>>>>>>> seems >>>>>>>> correct): >>>>>>>>>>> >>>>>>>>>>> x = GRanges(ranges=IRanges(start=5, end=5), seqnames="chr1", >>>>>>>> strand="*") >>>>>>>>>>> Y = GRanges(ranges=IRanges(start=c(6,7), end=c(6,7)), >>>>>>>> seqnames="chr1", >>>>>>>> strand=c("+","-")) >>>>>>>>>>> distance(x, Y[1]) >>>>>>>>>>> [1] 1 >>>>>>>>>>> distance(x, Y[2]) >>>>>>>>>>> [1] 2 >>>>>>>>>>> >>>>>>>>>>> nearest(x, Y, ignore.strand=TRUE) # correct >>>>>>>>>>> [1] 1 >>>>>>>>>>> nearest(ranges(x), ranges(Y)) # also correct >>>>>>>>>>> [1] 1 >>>>>>>>>>> >>>>>>>>>>> However, >>>>>>>>>>> >>>>>>>>>>> nearest(x, Y) >>>>>>>>>>> [1] 2 >>>>>>>>>>> Thanks for reporting this bug. Now fixed in GenomicRanges 1.11.3 and >>>>>>>> 1.10.2. >>>>>>>>>>> >>>>>>>>>>> The problem was in .nearest() where I computed the preceding and >>>>>>>> following distance of the ranges. I was missing abs() so the distance >>>>>>>> for >>>>>>>> the second range was -2 instead of 2. When checking to see which range >>>>>>>> was >>>>>>>> closest >>>>>>>>>>> >>>>>>>>>>> 1< -2 >>>>>>>>>>> [1] FALSE >>>>>>>>>>> >>>>>>>>>>> so the second range won. >>>>>>>>>>> distanceToNearest(x, Y) >>>>>>>>>>> DataFrame with 1 row and 3 columns >>>>>>>>>>> queryHits subjectHits distance >>>>>>>>>>> <integer> <integer> <integer> >>>>>>>>>>> 1 1 2 2 >>>>>>>>>>> >>>>>>>>>>> distanceToNearest(x, Y, ignore.strand=TRUE) >>>>>>>>>>> DataFrame with 1 row and 3 columns >>>>>>>>>>> queryHits subjectHits distance >>>>>>>>>>> <integer> <integer> <integer> >>>>>>>>>>> 1 1 1 1 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Finally, >>>>>>>>>>> >>>>>>>>>>> follow(ranges(x), ranges(Y)) >>>>>>>>>>> [1] NA >>>>>>>>>>> >>>>>>>>>>> The issue (along with GenomicRanges:::.nearest) must come down to >>>>>>>> this: >>>>>>>>>>> >>>>>>>>>>> follow(x, Y) # how can this be? >>>>>>>>>>> [1] 2 >>>>>>>>>>> This behavior is due to the fact that a '*' range can compare itself >>>>>>>> to either '+' or '-'. The 'x' is '*" located at position 5. When >>>>>>>> comparing it to the '+' Y[1] >>>>>>>>>>> we think of both ranges as '+'. precede() and follow() locations are >>>>>>>> determined by moving from 3' to 5'. On '+' strand this is from left to >>>>>>>> right so 5 precedes 6. >>>>>>>>>>> >>>>>>>>>>> precede(x, Y[1]) >>>>>>>>>>> [1] 1 >>>>>>>>>>> >>>>>>>>>>> but does not follow it >>>>>>>>>>> >>>>>>>>>>> follow(x, Y[1]) >>>>>>>>>>> [1] NA >>>>>>>>>>> >>>>>>>>>>> When comparing the '*' to Y[2] we think of both ranges as '-'. On >>>>>>>> the >>>>>>>> '-' moving from 3' to 5' is right to left so the 5 follows 7 >>>>>>>>>>> >>>>>>>>>>> follow(x, Y[2]) >>>>>>>>>>> [1] 1 >>>>>>>>>>> >>>>>>>>>>> but does not precede it >>>>>>>>>>> >>>>>>>>>>> precede(x, Y[2]) >>>>>>>>>>> [1] NA >>>>>>>>>>> >>>>>>>>>>> When we set ignore.strand=TRUE, all ranges are considered '+'. >>>>>>>>>>> >>>>>>>>>>> Valerie >>>>>>>>>>> whereas >>>>>>>>>>> >>>>>>>>>>> follow(x, Y, ignore.strand=TRUE) # correct, I think >>>>>>>>>>> [1] NA >>>>>>>>>>> >>>>>>>>>>> sessionInfo() >>>>>>>>>>> R Under development (unstable) (2012-10-10 r60908) >>>>>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>>>>>>> >>>>>>>>>>> locale: >>>>>>>>>>> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C >>>>>>>>>>> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 >>>>>>>>>>> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 >>>>>>>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>>>>>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C >>>>>>>>>>> >>>>>>>>>>> attached base packages: >>>>>>>>>>> [1] stats graphics grDevices datasets utils methods base >>>>>>>>>>> >>>>>>>>>>> other attached packages: >>>>>>>>>>> [1] GenomicRanges_1.11.0 IRanges_1.17.0 BiocGenerics_0.5.0 >>>>>>>>>>> >>>>>>>>>>> loaded via a namespace (and not attached): >>>>>>>>>>> [1] parallel_2.16.0 stats4_2.16.0 tools_2.16.0 >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Bioconductor mailing list >>>>>>>>>>> Bioconductor at r-project.org >>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>>>> Search the archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Bioconductor mailing list >>>>>>>>>>> Bioconductor at r-project.org >>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>>>> Search the archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Hervé Pagès >>>>>>>>> >>>>>>>>> Program in Computational Biology >>>>>>>>> Division of Public Health Sciences >>>>>>>>> Fred Hutchinson Cancer Research Center >>>>>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>>>>> P.O. Box 19024 >>>>>>>>> Seattle, WA 98109-1024 >>>>>>>>> >>>>>>>>> E-mail: hpages at fhcrc.org >>>>>>>>> Phone: (206) 667-5791 >>>>>>>>> Fax: (206) 667-1319 >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> *A model is a lie that helps you see the truth.* >>>>>>> * >>>>>>> * >>>>>>> Howard >>>>>>> Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >>>>>>> >>>>>>> [[alternative HTML version deleted]] >>>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >
ADD REPLY

Login before adding your answer.

Traffic: 763 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6