Search
Question: nearest without overlapping
0
5 months ago by
endrebak8530
endrebak8530 wrote:

How I find distanceToNearest without overlapping ranges? I cannot see a flag for it in the docs. Any plans to add this?

modified 5 months ago by Hervé Pagès ♦♦ 13k • written 5 months ago by endrebak8530
2
5 months ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi,

Don't remove anything from subject **before** you call distanceToNearest() or you will loose hits that you want to keep:

library(GenomicRanges)
x <- GRanges(c("chr1:2-4", "chr1:5-8"))
subject <- GRanges("chr1:7-10")

distanceToNearest(x, subject)
# Hits object with 2 hits and 1 metadata column:
#       queryHits subjectHits |  distance
#       <integer>   <integer> | <integer>
#   [1]         1           1 |         2
#   [2]         2           1 |         0
#   -------
#   queryLength: 2 / subjectLength: 1

overlaps <- subjectHits(findOverlaps(x, subject))
distanceToNearest(x, subject[-overlaps])
# Hits object with 0 hits and 1 metadata column:
#    queryHits subjectHits |  distance
#    <integer>   <integer> | <integer>
#   -------
#   queryLength: 2 / subjectLength: 0


Instead, remove hits **after** calling distanceToNearest():

hits <- distanceToNearest(x, subject)
code <- pcompare(x[queryHits(hits)], subject[subjectHits(hits)])
hits[abs(code) > 4]
# Hits object with 1 hit and 1 metadata column:
#       queryHits subjectHits |  distance
#       <integer>   <integer> | <integer>
#   [1]         1           1 |         2
#   -------
#   queryLength: 2 / subjectLength: 1

See ?IRanges::pcompare for the meaning of the overlapping codes.

Would be nice to have this as a flag in distanceToNearest().

H

This will never find the nearest-but-not-overlapping hit for a query if there is any overlap in the subject, right? The same limitation as my solution, but more efficient. And there is poverlaps() for this.

I see. Yes finding the nearest-but-not-overlapping ranges requires a little bit more work. One way to go is to take advantage of the fact that precede() and nearest() both exclude overlapping ranges:

### A version of distanceToNearest() that finds all the
### nearest-but-not-overlapping ranges in 'subject' for each range
### in 'x':
distanceToNearest2 <- function(x, subject)
{
hits <- union(precede(x, subject, select="all"),
follow(x, subject, select="all"))
mcols(hits)$distance <- distance(x[queryHits(hits)], subject[subjectHits(hits)]) x2subject <- as(hits, "IntegerList") x2d <- relist(mcols(hits)$distance, x2subject)
hits[unlist(min(x2d) == x2d)]
}

Then:

library(GenomicRanges)
x <- GRanges(c("chr1:2-4", "chr1:5-8", "chr1:12-13",
"chr1:10-15", "chr1:16-17","chr1:20-19"))
subject <- GRanges(c("chr1:8-10", "chr1:15-16", "chr1:15-15"))
distanceToNearest2(x, subject)
# Hits object with 8 hits and 1 metadata column:
#       queryHits subjectHits |  distance
#       <integer>   <integer> | <integer>
#   [1]         1           1 |         3
#   [2]         2           2 |         6
#   [3]         2           3 |         6
#   [4]         3           2 |         1
#   [5]         3           3 |         1
#   [6]         3           1 |         1
#   [7]         5           3 |         0
#   [8]         6           2 |         3
#   -------
#   queryLength: 6 / subjectLength: 3

H.

This part of the code

    x2subject <- as(hits, "IntegerList")
x2d <- relist(mcols(hits)$distance, x2subject) hits[unlist(min(x2d) == x2d)] was to me highly idiosyncratic (in retrospect I guess there was only one thing that I didn't recognize immediately, which was the coercion of hits to IntegerList, but that was enough to send me off on the follow...). I started with a 'base R' solution (using 'query' instead of 'x' for consistency with the argument names of distanceToNearest()). Here's the distance between each 'nearest' query / subject pair  distance <- distance(query[queryHits(hits)], subject[subjectHits(hits)]) The challenge is to apply a function (which values are the minimum?) to the distances, grouped by query hit. If I didn't have any groups, I could use fun1() to return a logical vector indicating the values that are minima  fun1 <- function(x) min(x) == x To apply this function to grouped values, a base R solution is to use the function ave()  keep <- as.logical(ave(distance, queryHits(hits), FUN = fun1) and then to clean up results  ## clean-up mcols(hits)$distance <- distance
hits[keep]

distanceToNearest() doesn't actually return ties, so I modified the function used in ave() to

    fun2 <- function(x)
min(x) == x & !duplicated(x)

I think the performance of ave() wouldn't be too bad for a typical script, and the overall code seems to me easier to understand (ave() is probably relatively specialized knowledge, but the problem of operating on grouped values in base R comes up quite frequently on StackOverflow, so once one recognizes the problem being faced -- apply a vectorized function to a vector split into groups -- a solution isn't too far away).

One bag of tricks I do know about S4Vectors is that a list-of-integers can be efficiently created, manipulated 'group-wise', and turned into a vector. So I replace the use of ave() with

    lst <- splitAsList(distance, queryHits(hits)
keep <- unsplit(fun1(lst), queryHits(hits))

S4Vectors applies fun1() to each element of lst, i.e., group-wise. Maybe this isn't that much less idiosyncratic than Herve's -- my first line is Herve's lines 1 & 2, and the second line is the subset in Herve's line 3. But knowing the splitAsList() / unlist(fun()) trick applies to many different situations, whereas as(hits, "IntegerList") is very specific. So my revision is

    distanceToNearest3 <- function(query, subject, fun = fun2) {
hits <- union(
precede(query, subject, select="all"),
follow(query, subject, select="all")
)
## minimum distance by query
distance <- distance(query[queryHits(hits)], subject[subjectHits(hits)])
lst <- splitAsList(distance, queryHits(hits))
keep <- unsplit(fun(lst), queryHits(hits))
## clean-up
mcols(hits)\$distance <- distance
hits[keep]
}

Note that the relist/unlist and split/unsplit combos are elegant, powerful, and reliable idioms, but the split/unlist combo is a risky one. It only works if the splitting factor is sorted which is indeed the case here because the class of Hits object hits is actually SortedByQueryHits.

And because hits is a SortedByQueryHits object, relist/unlist can be used instead of split/unsplit or split/unlist. The cost of relist(distance, CompressedIntegerList) is virtually zero whereas splitAsList(distance, f) generally has to pay the cost of sorting the split factor or at least checking that the split factor is already sorted.

Another advantage of relist(distance, skeleton) is that it is guaranteed to return a List derivative that is parallel to skeleton when skeleton is itself a List derivative. splitAsList(distance, f) OTOH returns a List derivative that is parallel to sort(unique(f)) when f is an atomic vector. This means in particular that queries not represented in queryHits(hits) are not represented in splitAsList(distance, queryHits(hits)) either, which could be problematic in some contexts.

H.

Thanks for reminding me (again) of the split / unlist problems; I updated my comment to use split / unsplit.

Wouldn't the code be faster with

        precede(query, subject, select="first"),
follow(query, subject, select="last")

?

The only problem is that the code errors:

Error in (function (classes, fdef, mtable)  :
unable to find an inherited method for function 'from' for signature '"integer"'
Calls: distanceToNearest3 -> distance -> [ -> queryHits -> from -> <Anonymous>

I am comparing genomicranges with pyranges for a preprint I am writing and I am trying to optimize the competing libraries as much as possible.

0
5 months ago by
Valerie Obenchain ♦♦ 6.7k
United States
Valerie Obenchain ♦♦ 6.7k wrote:

You could filter out the overlapping ranges before calling distanceToNearest():

> query <- GRanges("chr1", IRanges(5, width=6))
> subject <- GRanges("chr1", IRanges(start=c(15, 10, 20), width=5))
> overlaps <- subjectHits(findOverlaps(query, subject))
> distanceToNearest(query, subject[-overlaps])
Hits object with 1 hit and 1 metadata column:
queryHits subjectHits |  distance
<integer>   <integer> | <integer>
[1]         1           1 |         4
-------
queryLength: 1 / subjectLength: 2

Valerie

The above will remove all subject ranges with any overlap in the query, so there may be cases where a nearest match is not found, since the subject range overlapped a different query range. Instead, use set arithmetic on the hits objects, like

setdiff(distanceToNearest(query, subject), findOverlaps(query, subject))

Doesn't that leave some subjects (or is it queries?) without nearest queries (or is it subjects?)?

Yes, that's right, if a query overlaps a subject range, no hits for it are returned. Depending on exactly what is needed here, there may not be a good solution. The nearest algorithm, as I recall, is greedy and based on positions, so it would need to be rewritten to support this mode. Perhaps the asker could describe their concrete use case. The bedtools closest -io command is documented to behave like Val's solution, where subject ranges with any overlap in the query are ignored.

To remove overlaps, it's also possible to do this:

distanceToNearest(query, subsetByOverlaps(subject, query, invert=TRUE))