Search
Question: Problem with IRanges and GenomicRanges
0
gravatar for spollack
11 days ago by
spollack0
Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston
spollack0 wrote:

Dear Bioconductor,

I think it is possible that recent modifications to IRanges and GenomicRanges broke our bumphunter package.

bumphunter does a very simple test that amounts to this:

R> x <- data.frame(start = c(2, 4, 7, 10, 13, 17, 19, 22, 24), end = c(3, 5, 9, 12, 16, 18, 20, 23, 26), chr = 'chr1')
R> subject <- data.frame(start = c(1, 4, 7, 11, 14, 21, 25), 
end = c(3, 6, 9, 12, 15, 22, 27), chr = 'chr1')
grx <- makeGRangesFromDataFrame(x)
grs <- makeGRangesFromDataFrame(subject)
IRanges::nearest(ranges(grx), ranges(grs)) 

This is the result with GenomicRanges 1.29.12 and IRanges 2.11.12 - looks right - [4,5] is in [4,6]

> IRanges::nearest(ranges(grx), ranges(grs))
[1] 1 2 3 4 5 5 6 6 7

This is the result with GenomicRanges 1.29.13 and IRanges 2.11.15 - looks wrong - [4,5] is in [2,3]

> IRanges::nearest(ranges(grx), ranges(grs))
[1] 1 1 2 3 4 5 6 6 7

More output for clarification:

> ranges(grx)
IRanges object with 9 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         2         3         2
  [2]         4         5         2
  [3]         7         9         3
  [4]        10        12         3
  [5]        13        16         4
  [6]        17        18         2
  [7]        19        20         2
  [8]        22        23         2
  [9]        24        26         3
> ranges(grs)
IRanges object with 7 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         1         3         3
  [2]         4         6         3
  [3]         7         9         3
  [4]        11        12         2
  [5]        14        15         2
  [6]        21        22         2
  [7]        25        27         3

Can you help us, please? Our package is flunking build on Bioconductor 3.5 and 3.6.

thanks,

- Sam

ADD COMMENTlink modified 11 days ago by Hervé Pagès ♦♦ 12k • written 11 days ago by spollack0
0
gravatar for Hervé Pagès
11 days ago by
Hervé Pagès ♦♦ 12k
United States
Hervé Pagès ♦♦ 12k wrote:

Hi Sam,

Strictly speaking nearest() is about distance, not about overlaps or about one interval being in the other. In terms of distance [4,5] is as close to [4,6] as to [2,3]. In both cases, the distance is 0 as reported by distanceToNearest():

x <- IRanges(4, 5)
subject <- IRanges(c(4, 2), c(6, 3))
x
# IRanges object with 1 range and 0 metadata columns:
#           start       end     width
#       <integer> <integer> <integer>
#   [1]         4         5         2
subject
# IRanges object with 2 ranges and 0 metadata columns:
#           start       end     width
#       <integer> <integer> <integer>
#   [1]         4         6         3
#   [2]         2         3         2
distanceToNearest(x, subject, select="all")
# Hits object with 2 hits and 1 metadata column:
#       queryHits subjectHits |  distance
#       <integer>   <integer> | <integer>
#   [1]         1           1 |         0
#   [2]         1           2 |         0
#   -------
#   queryLength: 1 / subjectLength: 2

So nearest() also reports 2 nearest ranges:

nearest(x, subject, select="all")
# Hits object with 2 hits and 0 metadata columns:
#       queryHits subjectHits
#       <integer>   <integer>
#   [1]         1           1
#   [2]         1           2
#   -------
#   queryLength: 1 / subjectLength: 2

It's important that nearest() and distanceToNearest() behave consistently with the underlying distance between ranges. This is why it was fixed recently. Before the fix, the above calls were reporting only one range.

When you don't specify select="all" and there is more than one nearest range, unfortunately nearest() and distanceToNearest() pick up one arbitrarily, and not necessarily the one you would have picked up intuitively. One could fairly argue that they should be able to do a better pick-up job by default. Like this modified version of nearest() (it picks up the range with most overlap when there is more than one nearest range):

poverlapWidth <- function(query, subject)
{
    overlap_score <- pmin(end(query), end(subject)) -
                     pmax(start(query), start(subject)) + 1L
    pmax(overlap_score, 0L)
}

smarterNearest <- function(x, subject)
{
    hits <- nearest(x, subject, select="all")
    mcols(hits)$ovwidth <- poverlapWidth(x[queryHits(hits)],
                                         subject[subjectHits(hits)])
    partitioning <- PartitioningByEnd(queryHits(hits), NG=queryLength(hits))
    ovwidth <- relist(mcols(hits)$ovwidth, partitioning)
    hits <- hits[unlist(ovwidth == max(ovwidth))]
    selectHits(hits, select="arbitrary")
}

We'll need to work on something like this...

In the mean time you can put it and use it in your package. It should do the right thing on IRanges objects. However, to make it work properly on GRanges objects, you'll need to modify poverlapWidth() to make it return 0 when the query and subject are not on the same chromosome or strand.

FWIW I'm actually planning to add something like poverlapWidth() to the IRanges and GenomicRanges packages (would be a generic with methods for IRanges and GRanges objects).

Cheers,

H.

ADD COMMENTlink modified 11 days ago • written 11 days ago by Hervé Pagès ♦♦ 12k

Confused. How can the distance between [4,5] and [2,3] be zero? The documentation for nearest-methods {IRanges} says:

  • x and y overlap => distance(x, y) == 0

    or

    x and y overlap or are adjacent <=> distance(x, y) == 0

 [4,5] and [2,3] do not overlap and they are not adjacent.

What is the definition of distance( [x,y], [u,v] )?

thanks,

- Sam

ADD REPLYlink written 11 days ago by spollack0

Aha, now I get it. These aren't points on the real line; they're segments. 

|--1--|--2--|--3--|--4--|--5--|--6--|

So |--2--|--3--| is adjacent to |--4--|--5--|

 

ADD REPLYlink written 11 days ago by spollack0

Exactly. This is why the width of a range with a single point is 1, not 0. I call it the "Lego-block model". All the ranges operations should behave according to that model.

H.

ADD REPLYlink written 11 days ago by Hervé Pagès ♦♦ 12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 271 users visited in the last hour