How I find distanceToNearest without overlapping ranges? I cannot see a flag for it in the docs. Any plans to add this?
How I find distanceToNearest without overlapping ranges? I cannot see a flag for it in the docs. Any plans to add this?
Hi,
Don't remove anything from subject **before** you call distanceToNearest() or you will loose hits that you want to keep:
library(GenomicRanges)
x <- GRanges(c("chr1:2-4", "chr1:5-8"))
subject <- GRanges("chr1:7-10")
distanceToNearest(x, subject)
# Hits object with 2 hits and 1 metadata column:
# queryHits subjectHits | distance
# <integer> <integer> | <integer>
# [1] 1 1 | 2
# [2] 2 1 | 0
# -------
# queryLength: 2 / subjectLength: 1
overlaps <- subjectHits(findOverlaps(x, subject))
distanceToNearest(x, subject[-overlaps])
# Hits object with 0 hits and 1 metadata column:
# queryHits subjectHits | distance
# <integer> <integer> | <integer>
# -------
# queryLength: 2 / subjectLength: 0
Instead, remove hits **after** calling distanceToNearest():
hits <- distanceToNearest(x, subject) code <- pcompare(x[queryHits(hits)], subject[subjectHits(hits)]) hits[abs(code) > 4] # Hits object with 1 hit and 1 metadata column: # queryHits subjectHits | distance # <integer> <integer> | <integer> # [1] 1 1 | 2 # ------- # queryLength: 2 / subjectLength: 1
See ?IRanges::pcompare for the meaning of the overlapping codes.
Would be nice to have this as a flag in distanceToNearest().
H
You could filter out the overlapping ranges before calling distanceToNearest():
> query <- GRanges("chr1", IRanges(5, width=6))
> subject <- GRanges("chr1", IRanges(start=c(15, 10, 20), width=5))
> overlaps <- subjectHits(findOverlaps(query, subject))
> distanceToNearest(query, subject[-overlaps])
Hits object with 1 hit and 1 metadata column:
queryHits subjectHits | distance
<integer> <integer> | <integer>
[1] 1 1 | 4
-------
queryLength: 1 / subjectLength: 2
Valerie
The above will remove all subject ranges with any overlap in the query, so there may be cases where a nearest match is not found, since the subject range overlapped a different query range. Instead, use set arithmetic on the hits objects, like
setdiff(distanceToNearest(query, subject), findOverlaps(query, subject))
Yes, that's right, if a query overlaps a subject range, no hits for it are returned. Depending on exactly what is needed here, there may not be a good solution. The nearest algorithm, as I recall, is greedy and based on positions, so it would need to be rewritten to support this mode. Perhaps the asker could describe their concrete use case. The bedtools closest -io command is documented to behave like Val's solution, where subject ranges with any overlap in the query are ignored.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
This will never find the nearest-but-not-overlapping hit for a query if there is any overlap in the subject, right? The same limitation as my solution, but more efficient. And there is
poverlaps()for this.I see. Yes finding the nearest-but-not-overlapping ranges requires a little bit more work. One way to go is to take advantage of the fact that
precede()andnearest()both exclude overlapping ranges:### A version of distanceToNearest() that finds all the ### nearest-but-not-overlapping ranges in 'subject' for each range ### in 'x': distanceToNearest2 <- function(x, subject) { hits <- union(precede(x, subject, select="all"), follow(x, subject, select="all")) mcols(hits)$distance <- distance(x[queryHits(hits)], subject[subjectHits(hits)]) x2subject <- as(hits, "IntegerList") x2d <- relist(mcols(hits)$distance, x2subject) hits[unlist(min(x2d) == x2d)] }Then:
library(GenomicRanges) x <- GRanges(c("chr1:2-4", "chr1:5-8", "chr1:12-13", "chr1:10-15", "chr1:16-17","chr1:20-19")) subject <- GRanges(c("chr1:8-10", "chr1:15-16", "chr1:15-15")) distanceToNearest2(x, subject) # Hits object with 8 hits and 1 metadata column: # queryHits subjectHits | distance # <integer> <integer> | <integer> # [1] 1 1 | 3 # [2] 2 2 | 6 # [3] 2 3 | 6 # [4] 3 2 | 1 # [5] 3 3 | 1 # [6] 3 1 | 1 # [7] 5 3 | 0 # [8] 6 2 | 3 # ------- # queryLength: 6 / subjectLength: 3H.
This part of the code
x2subject <- as(hits, "IntegerList") x2d <- relist(mcols(hits)$distance, x2subject) hits[unlist(min(x2d) == x2d)]was to me highly idiosyncratic (in retrospect I guess there was only one thing that I didn't recognize immediately, which was the coercion of
hitstoIntegerList, but that was enough to send me off on the follow...). I started with a 'base R' solution (using 'query' instead of 'x' for consistency with the argument names ofdistanceToNearest()). Here's the distance between each 'nearest' query / subject pairThe challenge is to apply a function (which values are the minimum?) to the distances, grouped by query hit. If I didn't have any groups, I could use
fun1()to return a logical vector indicating the values that are minimafun1 <- function(x) min(x) == xTo apply this function to grouped values, a base R solution is to use the function ave()
and then to clean up results
## clean-up mcols(hits)$distance <- distance hits[keep]distanceToNearest()doesn't actually return ties, so I modified the function used inave()tofun2 <- function(x) min(x) == x & !duplicated(x)I think the performance of
ave()wouldn't be too bad for a typical script, and the overall code seems to me easier to understand (ave() is probably relatively specialized knowledge, but the problem of operating on grouped values in base R comes up quite frequently on StackOverflow, so once one recognizes the problem being faced -- apply a vectorized function to a vector split into groups -- a solution isn't too far away).One bag of tricks I do know about S4Vectors is that a list-of-integers can be efficiently created, manipulated 'group-wise', and turned into a vector. So I replace the use of
ave()withlst <- splitAsList(distance, queryHits(hits) keep <- unsplit(fun1(lst), queryHits(hits))S4Vectors applies
fun1()to each element oflst, i.e., group-wise. Maybe this isn't that much less idiosyncratic than Herve's -- my first line is Herve's lines 1 & 2, and the second line is the subset in Herve's line 3. But knowing thesplitAsList()/unlist(fun())trick applies to many different situations, whereasas(hits, "IntegerList")is very specific. So my revision isdistanceToNearest3 <- function(query, subject, fun = fun2) { hits <- union( precede(query, subject, select="all"), follow(query, subject, select="all") ) ## minimum distance by query distance <- distance(query[queryHits(hits)], subject[subjectHits(hits)]) lst <- splitAsList(distance, queryHits(hits)) keep <- unsplit(fun(lst), queryHits(hits)) ## clean-up mcols(hits)$distance <- distance hits[keep] }Note that the
relist/unlistandsplit/unsplitcombos are elegant, powerful, and reliable idioms, but thesplit/unlistcombo is a risky one. It only works if the splitting factor is sorted which is indeed the case here because the class of Hits objecthitsis actually SortedByQueryHits.And because
hitsis a SortedByQueryHits object,relist/unlistcan be used instead ofsplit/unsplitorsplit/unlist.The cost ofrelist(distance, CompressedIntegerList)is virtually zero whereassplitAsList(distance, f)generally has to pay the cost of sorting the split factor or at least checking that the split factor is already sorted.Another advantage of
relist(distance, skeleton)is that it is guaranteed to return a List derivative that is parallel toskeletonwhenskeletonis itself a List derivative.splitAsList(distance, f)OTOH returns a List derivative that is parallel tosort(unique(f))whenfis an atomic vector. This means in particular that queries not represented inqueryHits(hits)are not represented insplitAsList(distance, queryHits(hits))either, which could be problematic in some contexts.H.
Thanks for reminding me (again) of the split / unlist problems; I updated my comment to use split / unsplit.
Wouldn't the code be faster with
precede(query, subject, select="first"), follow(query, subject, select="last")?
The only problem is that the code errors:
I am comparing genomicranges with pyranges for a preprint I am writing and I am trying to optimize the competing libraries as much as possible.