Problem with IRanges and GenomicRanges
1
0
Entering edit mode
spollack • 0
@spollack-12387
Last seen 5.5 years ago
Biostatistics and Computational Biology…

Dear Bioconductor,

I think it is possible that recent modifications to IRanges and GenomicRanges broke our bumphunter package.

bumphunter does a very simple test that amounts to this:

R> x <- data.frame(start = c(2, 4, 7, 10, 13, 17, 19, 22, 24), end = c(3, 5, 9, 12, 16, 18, 20, 23, 26), chr = 'chr1')
R> subject <- data.frame(start = c(1, 4, 7, 11, 14, 21, 25), 
end = c(3, 6, 9, 12, 15, 22, 27), chr = 'chr1')
grx <- makeGRangesFromDataFrame(x)
grs <- makeGRangesFromDataFrame(subject)
IRanges::nearest(ranges(grx), ranges(grs)) 

This is the result with GenomicRanges 1.29.12 and IRanges 2.11.12 - looks right - [4,5] is in [4,6]

> IRanges::nearest(ranges(grx), ranges(grs))
[1] 1 2 3 4 5 5 6 6 7

This is the result with GenomicRanges 1.29.13 and IRanges 2.11.15 - looks wrong - [4,5] is in [2,3]

> IRanges::nearest(ranges(grx), ranges(grs))
[1] 1 1 2 3 4 5 6 6 7

More output for clarification:

> ranges(grx)
IRanges object with 9 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         2         3         2
  [2]         4         5         2
  [3]         7         9         3
  [4]        10        12         3
  [5]        13        16         4
  [6]        17        18         2
  [7]        19        20         2
  [8]        22        23         2
  [9]        24        26         3
> ranges(grs)
IRanges object with 7 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         1         3         3
  [2]         4         6         3
  [3]         7         9         3
  [4]        11        12         2
  [5]        14        15         2
  [6]        21        22         2
  [7]        25        27         3

Can you help us, please? Our package is flunking build on Bioconductor 3.5 and 3.6.

thanks,

- Sam

iranges genomicranges • 1.2k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States

Hi Sam,

Strictly speaking nearest() is about distance, not about overlaps or about one interval being in the other. In terms of distance [4,5] is as close to [4,6] as to [2,3]. In both cases, the distance is 0 as reported by distanceToNearest():

x <- IRanges(4, 5)
subject <- IRanges(c(4, 2), c(6, 3))
x
# IRanges object with 1 range and 0 metadata columns:
#           start       end     width
#       <integer> <integer> <integer>
#   [1]         4         5         2
subject
# IRanges object with 2 ranges and 0 metadata columns:
#           start       end     width
#       <integer> <integer> <integer>
#   [1]         4         6         3
#   [2]         2         3         2
distanceToNearest(x, subject, select="all")
# Hits object with 2 hits and 1 metadata column:
#       queryHits subjectHits |  distance
#       <integer>   <integer> | <integer>
#   [1]         1           1 |         0
#   [2]         1           2 |         0
#   -------
#   queryLength: 1 / subjectLength: 2

So nearest() also reports 2 nearest ranges:

nearest(x, subject, select="all")
# Hits object with 2 hits and 0 metadata columns:
#       queryHits subjectHits
#       <integer>   <integer>
#   [1]         1           1
#   [2]         1           2
#   -------
#   queryLength: 1 / subjectLength: 2

It's important that nearest() and distanceToNearest() behave consistently with the underlying distance between ranges. This is why it was fixed recently. Before the fix, the above calls were reporting only one range.

When you don't specify select="all" and there is more than one nearest range, unfortunately nearest() and distanceToNearest() pick up one arbitrarily, and not necessarily the one you would have picked up intuitively. One could fairly argue that they should be able to do a better pick-up job by default. Like this modified version of nearest() (it picks up the range with most overlap when there is more than one nearest range):

poverlapWidth <- function(query, subject)
{
    overlap_score <- pmin(end(query), end(subject)) -
                     pmax(start(query), start(subject)) + 1L
    pmax(overlap_score, 0L)
}

smarterNearest <- function(x, subject)
{
    hits <- nearest(x, subject, select="all")
    mcols(hits)$ovwidth <- poverlapWidth(x[queryHits(hits)],
                                         subject[subjectHits(hits)])
    partitioning <- PartitioningByEnd(queryHits(hits), NG=queryLength(hits))
    ovwidth <- relist(mcols(hits)$ovwidth, partitioning)
    hits <- hits[unlist(ovwidth == max(ovwidth))]
    selectHits(hits, select="arbitrary")
}

We'll need to work on something like this...

In the mean time you can put it and use it in your package. It should do the right thing on IRanges objects. However, to make it work properly on GRanges objects, you'll need to modify poverlapWidth() to make it return 0 when the query and subject are not on the same chromosome or strand.

FWIW I'm actually planning to add something like poverlapWidth() to the IRanges and GenomicRanges packages (would be a generic with methods for IRanges and GRanges objects).

Cheers,

H.

ADD COMMENT
0
Entering edit mode

Confused. How can the distance between [4,5] and [2,3] be zero? The documentation for nearest-methods {IRanges} says:

  • x and y overlap => distance(x, y) == 0

    or

    x and y overlap or are adjacent <=> distance(x, y) == 0

 [4,5] and [2,3] do not overlap and they are not adjacent.

What is the definition of distance( [x,y], [u,v] )?

thanks,

- Sam

ADD REPLY
0
Entering edit mode

Aha, now I get it. These aren't points on the real line; they're segments. 

|--1--|--2--|--3--|--4--|--5--|--6--|

So |--2--|--3--| is adjacent to |--4--|--5--|

 

ADD REPLY
0
Entering edit mode

Exactly. This is why the width of a range with a single point is 1, not 0. I call it the "Lego-block model". All the ranges operations should behave according to that model.

H.

ADD REPLY

Login before adding your answer.

Traffic: 657 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6