Question: nearest without overlapping
0
gravatar for endrebak85
10 months ago by
endrebak8530
endrebak8530 wrote:

How I find distanceToNearest without overlapping ranges? I cannot see a flag for it in the docs. Any plans to add this?

genomicranges • 321 views
ADD COMMENTlink modified 10 months ago by Hervé Pagès ♦♦ 13k • written 10 months ago by endrebak8530
Answer: nearest without overlapping
2
gravatar for Hervé Pagès
10 months ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi,

Don't remove anything from subject **before** you call distanceToNearest() or you will loose hits that you want to keep:

library(GenomicRanges)
x <- GRanges(c("chr1:2-4", "chr1:5-8"))
subject <- GRanges("chr1:7-10")

distanceToNearest(x, subject)
# Hits object with 2 hits and 1 metadata column:
#       queryHits subjectHits |  distance
#       <integer>   <integer> | <integer>
#   [1]         1           1 |         2
#   [2]         2           1 |         0
#   -------
#   queryLength: 2 / subjectLength: 1

overlaps <- subjectHits(findOverlaps(x, subject))
distanceToNearest(x, subject[-overlaps])
# Hits object with 0 hits and 1 metadata column:
#    queryHits subjectHits |  distance
#    <integer>   <integer> | <integer>
#   -------
#   queryLength: 2 / subjectLength: 0

Instead, remove hits **after** calling distanceToNearest():

hits <- distanceToNearest(x, subject)
code <- pcompare(x[queryHits(hits)], subject[subjectHits(hits)])
hits[abs(code) > 4]
# Hits object with 1 hit and 1 metadata column:
#       queryHits subjectHits |  distance
#       <integer>   <integer> | <integer>
#   [1]         1           1 |         2
#   -------
#   queryLength: 2 / subjectLength: 1

See ?IRanges::pcompare for the meaning of the overlapping codes.

Would be nice to have this as a flag in distanceToNearest().

H

ADD COMMENTlink modified 10 months ago • written 10 months ago by Hervé Pagès ♦♦ 13k

This will never find the nearest-but-not-overlapping hit for a query if there is any overlap in the subject, right? The same limitation as my solution, but more efficient. And there is poverlaps() for this. 

ADD REPLYlink written 10 months ago by Michael Lawrence11k

I see. Yes finding the nearest-but-not-overlapping ranges requires a little bit more work. One way to go is to take advantage of the fact that precede() and nearest() both exclude overlapping ranges:

### A version of distanceToNearest() that finds all the
### nearest-but-not-overlapping ranges in 'subject' for each range
### in 'x':
distanceToNearest2 <- function(x, subject)
{
    hits <- union(precede(x, subject, select="all"),
                  follow(x, subject, select="all"))
    mcols(hits)$distance <- distance(x[queryHits(hits)],
                                     subject[subjectHits(hits)])
    x2subject <- as(hits, "IntegerList")
    x2d <- relist(mcols(hits)$distance, x2subject)
    hits[unlist(min(x2d) == x2d)]
}

Then:

library(GenomicRanges)
x <- GRanges(c("chr1:2-4", "chr1:5-8", "chr1:12-13",
               "chr1:10-15", "chr1:16-17","chr1:20-19"))
subject <- GRanges(c("chr1:8-10", "chr1:15-16", "chr1:15-15"))
distanceToNearest2(x, subject)
# Hits object with 8 hits and 1 metadata column:
#       queryHits subjectHits |  distance
#       <integer>   <integer> | <integer>
#   [1]         1           1 |         3
#   [2]         2           2 |         6
#   [3]         2           3 |         6
#   [4]         3           2 |         1
#   [5]         3           3 |         1
#   [6]         3           1 |         1
#   [7]         5           3 |         0
#   [8]         6           2 |         3
#   -------
#   queryLength: 6 / subjectLength: 3

H.

ADD REPLYlink modified 10 months ago • written 10 months ago by Hervé Pagès ♦♦ 13k

This part of the code

    x2subject <- as(hits, "IntegerList")
    x2d <- relist(mcols(hits)$distance, x2subject)
    hits[unlist(min(x2d) == x2d)]

was to me highly idiosyncratic (in retrospect I guess there was only one thing that I didn't recognize immediately, which was the coercion of hits to IntegerList, but that was enough to send me off on the follow...). I started with a 'base R' solution (using 'query' instead of 'x' for consistency with the argument names of distanceToNearest()). Here's the distance between each 'nearest' query  / subject pair

    distance <- distance(query[queryHits(hits)], subject[subjectHits(hits)])

The challenge is to apply a function (which values are the minimum?) to the distances, grouped by query hit. If I didn't have any groups, I could use fun1() to return a logical vector indicating the values that are minima 

    fun1 <- function(x)
        min(x) == x

To apply this function to grouped values, a base R solution is to use the function ave()

    keep <- as.logical(ave(distance, queryHits(hits), FUN = fun1)

and then to clean up results

    ## clean-up
    mcols(hits)$distance <- distance
    hits[keep]

distanceToNearest() doesn't actually return ties, so I modified the function used in ave() to

    fun2 <- function(x)
        min(x) == x & !duplicated(x)

I think the performance of ave() wouldn't be too bad for a typical script, and the overall code seems to me easier to understand (ave() is probably relatively specialized knowledge, but the problem of operating on grouped values in base R comes up quite frequently on StackOverflow, so once one recognizes the problem being faced -- apply a vectorized function to a vector split into groups -- a solution isn't too far away).

One bag of tricks I do know about S4Vectors is that a list-of-integers can be efficiently created, manipulated 'group-wise', and turned into a vector. So I replace the use of ave() with

    lst <- splitAsList(distance, queryHits(hits)
    keep <- unsplit(fun1(lst), queryHits(hits))

S4Vectors applies fun1() to each element of lst, i.e., group-wise. Maybe this isn't that much less idiosyncratic than Herve's -- my first line is Herve's lines 1 & 2, and the second line is the subset in Herve's line 3. But knowing the splitAsList() / unlist(fun()) trick applies to many different situations, whereas as(hits, "IntegerList") is very specific. So my revision is

    distanceToNearest3 <- function(query, subject, fun = fun2) {
        hits <- union(
            precede(query, subject, select="all"),
            follow(query, subject, select="all")
        )
        ## minimum distance by query
        distance <- distance(query[queryHits(hits)], subject[subjectHits(hits)])
        lst <- splitAsList(distance, queryHits(hits))
        keep <- unsplit(fun(lst), queryHits(hits))
        ## clean-up
        mcols(hits)$distance <- distance
        hits[keep]
    }
ADD REPLYlink modified 10 months ago • written 10 months ago by Martin Morgan ♦♦ 23k

Note that the relist/unlist and split/unsplit combos are elegant, powerful, and reliable idioms, but the split/unlist combo is a risky one. It only works if the splitting factor is sorted which is indeed the case here because the class of Hits object hits is actually SortedByQueryHits.

And because hits is a SortedByQueryHits object, relist/unlist can be used instead of split/unsplit or split/unlist. The cost of relist(distance, CompressedIntegerList) is virtually zero whereas splitAsList(distance, f) generally has to pay the cost of sorting the split factor or at least checking that the split factor is already sorted.

Another advantage of relist(distance, skeleton) is that it is guaranteed to return a List derivative that is parallel to skeleton when skeleton is itself a List derivative. splitAsList(distance, f) OTOH returns a List derivative that is parallel to sort(unique(f)) when f is an atomic vector. This means in particular that queries not represented in queryHits(hits) are not represented in splitAsList(distance, queryHits(hits)) either, which could be problematic in some contexts.

H.

ADD REPLYlink modified 10 months ago • written 10 months ago by Hervé Pagès ♦♦ 13k

Thanks for reminding me (again) of the split / unlist problems; I updated my comment to use split / unsplit.

ADD REPLYlink modified 10 months ago • written 10 months ago by Martin Morgan ♦♦ 23k

Wouldn't the code be faster with

        precede(query, subject, select="first"),
        follow(query, subject, select="last")

?

The only problem is that the code errors:

Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function 'from' for signature '"integer"'
Calls: distanceToNearest3 -> distance -> [ -> queryHits -> from -> <Anonymous>

I am comparing genomicranges with pyranges for a preprint I am writing and I am trying to optimize the competing libraries as much as possible.

ADD REPLYlink modified 10 months ago • written 10 months ago by endrebak8530
Answer: nearest without overlapping
0
gravatar for Valerie Obenchain
10 months ago by
United States
Valerie Obenchain6.7k wrote:

You could filter out the overlapping ranges before calling distanceToNearest():

> query <- GRanges("chr1", IRanges(5, width=6))
> subject <- GRanges("chr1", IRanges(start=c(15, 10, 20), width=5))
> overlaps <- subjectHits(findOverlaps(query, subject))
> distanceToNearest(query, subject[-overlaps])
Hits object with 1 hit and 1 metadata column:
      queryHits subjectHits |  distance
      <integer>   <integer> | <integer>
  [1]         1           1 |         4
  -------
  queryLength: 1 / subjectLength: 2

Valerie

ADD COMMENTlink written 10 months ago by Valerie Obenchain6.7k

The above will remove all subject ranges with any overlap in the query, so there may be cases where a nearest match is not found, since the subject range overlapped a different query range. Instead, use set arithmetic on the hits objects, like

setdiff(distanceToNearest(query, subject), findOverlaps(query, subject))
ADD REPLYlink written 10 months ago by Michael Lawrence11k

Doesn't that leave some subjects (or is it queries?) without nearest queries (or is it subjects?)?

ADD REPLYlink written 10 months ago by Martin Morgan ♦♦ 23k

Yes, that's right, if a query overlaps a subject range, no hits for it are returned. Depending on exactly what is needed here, there may not be a good solution. The nearest algorithm, as I recall, is greedy and based on positions, so it would need to be rewritten to support this mode. Perhaps the asker could describe their concrete use case. The bedtools closest -io command is documented to behave like Val's solution, where subject ranges with any overlap in the query are ignored.

ADD REPLYlink written 10 months ago by Michael Lawrence11k

To remove overlaps, it's also possible to do this:

distanceToNearest(query, subsetByOverlaps(subject, query, invert=TRUE))

ADD REPLYlink written 10 months ago by Michael Lawrence11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 202 users visited in the last hour