Search
Question: Problem with IRanges and GenomicRanges
0
13 months ago by
spollack0
Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston
spollack0 wrote:

Dear Bioconductor,

I think it is possible that recent modifications to IRanges and GenomicRanges broke our bumphunter package.

bumphunter does a very simple test that amounts to this:

R> x <- data.frame(start = c(2, 4, 7, 10, 13, 17, 19, 22, 24), end = c(3, 5, 9, 12, 16, 18, 20, 23, 26), chr = 'chr1')
R> subject <- data.frame(start = c(1, 4, 7, 11, 14, 21, 25),
end = c(3, 6, 9, 12, 15, 22, 27), chr = 'chr1')
grx <- makeGRangesFromDataFrame(x)
grs <- makeGRangesFromDataFrame(subject)
IRanges::nearest(ranges(grx), ranges(grs)) 

This is the result with GenomicRanges 1.29.12 and IRanges 2.11.12 - looks right - [4,5] is in [4,6]

> IRanges::nearest(ranges(grx), ranges(grs))
[1] 1 2 3 4 5 5 6 6 7

This is the result with GenomicRanges 1.29.13 and IRanges 2.11.15 - looks wrong - [4,5] is in [2,3]

> IRanges::nearest(ranges(grx), ranges(grs))
[1] 1 1 2 3 4 5 6 6 7

More output for clarification:

> ranges(grx)
IRanges object with 9 ranges and 0 metadata columns:
start       end     width
<integer> <integer> <integer>
[1]         2         3         2
[2]         4         5         2
[3]         7         9         3
[4]        10        12         3
[5]        13        16         4
[6]        17        18         2
[7]        19        20         2
[8]        22        23         2
[9]        24        26         3
> ranges(grs)
IRanges object with 7 ranges and 0 metadata columns:
start       end     width
<integer> <integer> <integer>
[1]         1         3         3
[2]         4         6         3
[3]         7         9         3
[4]        11        12         2
[5]        14        15         2
[6]        21        22         2
[7]        25        27         3

Can you help us, please? Our package is flunking build on Bioconductor 3.5 and 3.6.

thanks,

- Sam

modified 13 months ago by Hervé Pagès ♦♦ 13k • written 13 months ago by spollack0
0
13 months ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Sam,

Strictly speaking nearest() is about distance, not about overlaps or about one interval being in the other. In terms of distance [4,5] is as close to [4,6] as to [2,3]. In both cases, the distance is 0 as reported by distanceToNearest():

x <- IRanges(4, 5)
subject <- IRanges(c(4, 2), c(6, 3))
x
# IRanges object with 1 range and 0 metadata columns:
#           start       end     width
#       <integer> <integer> <integer>
#   [1]         4         5         2
subject
# IRanges object with 2 ranges and 0 metadata columns:
#           start       end     width
#       <integer> <integer> <integer>
#   [1]         4         6         3
#   [2]         2         3         2
distanceToNearest(x, subject, select="all")
# Hits object with 2 hits and 1 metadata column:
#       queryHits subjectHits |  distance
#       <integer>   <integer> | <integer>
#   [1]         1           1 |         0
#   [2]         1           2 |         0
#   -------
#   queryLength: 1 / subjectLength: 2

So nearest() also reports 2 nearest ranges:

nearest(x, subject, select="all")
# Hits object with 2 hits and 0 metadata columns:
#       queryHits subjectHits
#       <integer>   <integer>
#   [1]         1           1
#   [2]         1           2
#   -------
#   queryLength: 1 / subjectLength: 2


It's important that nearest() and distanceToNearest() behave consistently with the underlying distance between ranges. This is why it was fixed recently. Before the fix, the above calls were reporting only one range.

When you don't specify select="all" and there is more than one nearest range, unfortunately nearest() and distanceToNearest() pick up one arbitrarily, and not necessarily the one you would have picked up intuitively. One could fairly argue that they should be able to do a better pick-up job by default. Like this modified version of nearest() (it picks up the range with most overlap when there is more than one nearest range):

poverlapWidth <- function(query, subject)
{
overlap_score <- pmin(end(query), end(subject)) -
pmax(start(query), start(subject)) + 1L
pmax(overlap_score, 0L)
}

smarterNearest <- function(x, subject)
{
hits <- nearest(x, subject, select="all")
mcols(hits)$ovwidth <- poverlapWidth(x[queryHits(hits)], subject[subjectHits(hits)]) partitioning <- PartitioningByEnd(queryHits(hits), NG=queryLength(hits)) ovwidth <- relist(mcols(hits)$ovwidth, partitioning)
hits <- hits[unlist(ovwidth == max(ovwidth))]
selectHits(hits, select="arbitrary")
}

We'll need to work on something like this...

In the mean time you can put it and use it in your package. It should do the right thing on IRanges objects. However, to make it work properly on GRanges objects, you'll need to modify poverlapWidth() to make it return 0 when the query and subject are not on the same chromosome or strand.

FWIW I'm actually planning to add something like poverlapWidth() to the IRanges and GenomicRanges packages (would be a generic with methods for IRanges and GRanges objects).

Cheers,

H.

Confused. How can the distance between [4,5] and [2,3] be zero? The documentation for nearest-methods {IRanges} says:

• x and y overlap => distance(x, y) == 0

or

x and y overlap or are adjacent <=> distance(x, y) == 0

[4,5] and [2,3] do not overlap and they are not adjacent.

What is the definition of distance( [x,y], [u,v] )?

thanks,

- Sam

Aha, now I get it. These aren't points on the real line; they're segments.

|--1--|--2--|--3--|--4--|--5--|--6--|

So |--2--|--3--| is adjacent to |--4--|--5--|

Exactly. This is why the width of a range with a single point is 1, not 0. I call it the "Lego-block model". All the ranges operations should behave according to that model.

H.