Question

Problem with IRanges and GenomicRanges

0

Entering edit mode

spollack • 0

@spollack-12387

Last seen 5.5 years ago

Biostatistics and Computational Biology…

Dear Bioconductor,

I think it is possible that recent modifications to IRanges and GenomicRanges broke our bumphunter package.

bumphunter does a very simple test that amounts to this:

R> x <- data.frame(start = c(2, 4, 7, 10, 13, 17, 19, 22, 24), end = c(3, 5, 9, 12, 16, 18, 20, 23, 26), chr = 'chr1')
R> subject <- data.frame(start = c(1, 4, 7, 11, 14, 21, 25), 
end = c(3, 6, 9, 12, 15, 22, 27), chr = 'chr1')
grx <- makeGRangesFromDataFrame(x)
grs <- makeGRangesFromDataFrame(subject)
IRanges::nearest(ranges(grx), ranges(grs))

This is the result with GenomicRanges 1.29.12 and IRanges 2.11.12 - looks right - [4,5] is in [4,6]

> IRanges::nearest(ranges(grx), ranges(grs))
[1] 1 2 3 4 5 5 6 6 7

This is the result with GenomicRanges 1.29.13 and IRanges 2.11.15 - looks wrong - [4,5] is in [2,3]

> IRanges::nearest(ranges(grx), ranges(grs))
[1] 1 1 2 3 4 5 6 6 7

More output for clarification:

> ranges(grx)
IRanges object with 9 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         2         3         2
  [2]         4         5         2
  [3]         7         9         3
  [4]        10        12         3
  [5]        13        16         4
  [6]        17        18         2
  [7]        19        20         2
  [8]        22        23         2
  [9]        24        26         3
> ranges(grs)
IRanges object with 7 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         1         3         3
  [2]         4         6         3
  [3]         7         9         3
  [4]        11        12         2
  [5]        14        15         2
  [6]        21        22         2
  [7]        25        27         3

Can you help us, please? Our package is flunking build on Bioconductor 3.5 and 3.6.

thanks,

- Sam

iranges genomicranges • 1.2k views

ADD COMMENT • link updated 6.6 years ago by Hervé Pagès 16k • written 6.6 years ago by spollack • 0

score 0 · Answer 1 · 2017-09-13

Hi Sam,

Strictly speaking nearest() is about distance, not about overlaps or about one interval being in the other. In terms of distance [4,5] is as close to [4,6] as to [2,3]. In both cases, the distance is 0 as reported by distanceToNearest():

x <- IRanges(4, 5)
subject <- IRanges(c(4, 2), c(6, 3))
x
# IRanges object with 1 range and 0 metadata columns:
#           start       end     width
#       <integer> <integer> <integer>
#   [1]         4         5         2
subject
# IRanges object with 2 ranges and 0 metadata columns:
#           start       end     width
#       <integer> <integer> <integer>
#   [1]         4         6         3
#   [2]         2         3         2
distanceToNearest(x, subject, select="all")
# Hits object with 2 hits and 1 metadata column:
#       queryHits subjectHits |  distance
#       <integer>   <integer> | <integer>
#   [1]         1           1 |         0
#   [2]         1           2 |         0
#   -------
#   queryLength: 1 / subjectLength: 2

So nearest() also reports 2 nearest ranges:

nearest(x, subject, select="all")
# Hits object with 2 hits and 0 metadata columns:
#       queryHits subjectHits
#       <integer>   <integer>
#   [1]         1           1
#   [2]         1           2
#   -------
#   queryLength: 1 / subjectLength: 2

It's important that nearest() and distanceToNearest() behave consistently with the underlying distance between ranges. This is why it was fixed recently. Before the fix, the above calls were reporting only one range.

When you don't specify select="all" and there is more than one nearest range, unfortunately nearest() and distanceToNearest() pick up one arbitrarily, and not necessarily the one you would have picked up intuitively. One could fairly argue that they should be able to do a better pick-up job by default. Like this modified version of nearest() (it picks up the range with most overlap when there is more than one nearest range):

poverlapWidth <- function(query, subject)
{
    overlap_score <- pmin(end(query), end(subject)) -
                     pmax(start(query), start(subject)) + 1L
    pmax(overlap_score, 0L)
}

smarterNearest <- function(x, subject)
{
    hits <- nearest(x, subject, select="all")
    mcols(hits)$ovwidth <- poverlapWidth(x[queryHits(hits)],
                                         subject[subjectHits(hits)])
    partitioning <- PartitioningByEnd(queryHits(hits), NG=queryLength(hits))
    ovwidth <- relist(mcols(hits)$ovwidth, partitioning)
    hits <- hits[unlist(ovwidth == max(ovwidth))]
    selectHits(hits, select="arbitrary")
}

We'll need to work on something like this...

In the mean time you can put it and use it in your package. It should do the right thing on IRanges objects. However, to make it work properly on GRanges objects, you'll need to modify poverlapWidth() to make it return 0 when the query and subject are not on the same chromosome or strand.

FWIW I'm actually planning to add something like poverlapWidth() to the IRanges and GenomicRanges packages (would be a generic with methods for IRanges and GRanges objects).

Cheers,

H.