How I find distanceToNearest without overlapping ranges? I cannot see a flag for it in the docs. Any plans to add this?
How I find distanceToNearest without overlapping ranges? I cannot see a flag for it in the docs. Any plans to add this?
Hi,
Don't remove anything from subject **before** you call distanceToNearest()
or you will loose hits that you want to keep:
library(GenomicRanges) x <- GRanges(c("chr1:2-4", "chr1:5-8")) subject <- GRanges("chr1:7-10") distanceToNearest(x, subject) # Hits object with 2 hits and 1 metadata column: # queryHits subjectHits | distance # <integer> <integer> | <integer> # [1] 1 1 | 2 # [2] 2 1 | 0 # ------- # queryLength: 2 / subjectLength: 1 overlaps <- subjectHits(findOverlaps(x, subject)) distanceToNearest(x, subject[-overlaps]) # Hits object with 0 hits and 1 metadata column: # queryHits subjectHits | distance # <integer> <integer> | <integer> # ------- # queryLength: 2 / subjectLength: 0
Instead, remove hits **after** calling distanceToNearest()
:
hits <- distanceToNearest(x, subject) code <- pcompare(x[queryHits(hits)], subject[subjectHits(hits)]) hits[abs(code) > 4] # Hits object with 1 hit and 1 metadata column: # queryHits subjectHits | distance # <integer> <integer> | <integer> # [1] 1 1 | 2 # ------- # queryLength: 2 / subjectLength: 1
See ?IRanges::pcompare
for the meaning of the overlapping codes.
Would be nice to have this as a flag in distanceToNearest()
.
H
You could filter out the overlapping ranges before calling distanceToNearest()
:
> query <- GRanges("chr1", IRanges(5, width=6)) > subject <- GRanges("chr1", IRanges(start=c(15, 10, 20), width=5)) > overlaps <- subjectHits(findOverlaps(query, subject)) > distanceToNearest(query, subject[-overlaps]) Hits object with 1 hit and 1 metadata column: queryHits subjectHits | distance <integer> <integer> | <integer> [1] 1 1 | 4 ------- queryLength: 1 / subjectLength: 2
Valerie
The above will remove all subject ranges with any overlap in the query, so there may be cases where a nearest match is not found, since the subject range overlapped a different query range. Instead, use set arithmetic on the hits objects, like
setdiff(distanceToNearest(query, subject), findOverlaps(query, subject))
Yes, that's right, if a query overlaps a subject range, no hits for it are returned. Depending on exactly what is needed here, there may not be a good solution. The nearest algorithm, as I recall, is greedy and based on positions, so it would need to be rewritten to support this mode. Perhaps the asker could describe their concrete use case. The bedtools closest -io
command is documented to behave like Val's solution, where subject ranges with any overlap in the query are ignored.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
This will never find the nearest-but-not-overlapping hit for a query if there is any overlap in the subject, right? The same limitation as my solution, but more efficient. And there is
poverlaps()
for this.I see. Yes finding the nearest-but-not-overlapping ranges requires a little bit more work. One way to go is to take advantage of the fact that
precede()
andnearest()
both exclude overlapping ranges:Then:
H.
This part of the code
was to me highly idiosyncratic (in retrospect I guess there was only one thing that I didn't recognize immediately, which was the coercion of
hits
toIntegerList
, but that was enough to send me off on the follow...). I started with a 'base R' solution (using 'query
' instead of 'x
' for consistency with the argument names ofdistanceToNearest()
). Here's the distance between each 'nearest' query / subject pairThe challenge is to apply a function (which values are the minimum?) to the distances, grouped by query hit. If I didn't have any groups, I could use
fun1()
to return a logical vector indicating the values that are minimaTo apply this function to grouped values, a base R solution is to use the function ave()
and then to clean up results
distanceToNearest()
doesn't actually return ties, so I modified the function used inave()
toI think the performance of
ave()
wouldn't be too bad for a typical script, and the overall code seems to me easier to understand (ave() is probably relatively specialized knowledge, but the problem of operating on grouped values in base R comes up quite frequently on StackOverflow, so once one recognizes the problem being faced -- apply a vectorized function to a vector split into groups -- a solution isn't too far away).One bag of tricks I do know about S4Vectors is that a list-of-integers can be efficiently created, manipulated 'group-wise', and turned into a vector. So I replace the use of
ave()
withS4Vectors applies
fun1()
to each element oflst
, i.e., group-wise. Maybe this isn't that much less idiosyncratic than Herve's -- my first line is Herve's lines 1 & 2, and the second line is the subset in Herve's line 3. But knowing thesplitAsList()
/unlist(fun())
trick applies to many different situations, whereasas(hits, "IntegerList")
is very specific. So my revision isNote that the
relist
/unlist
andsplit
/unsplit
combos are elegant, powerful, and reliable idioms, but thesplit
/unlist
combo is a risky one. It only works if the splitting factor is sorted which is indeed the case here because the class of Hits objecthits
is actually SortedByQueryHits.And because
hits
is a SortedByQueryHits object,relist
/unlist
can be used instead ofsplit
/unsplit
orsplit
/unlist.
The cost ofrelist(distance, CompressedIntegerList)
is virtually zero whereassplitAsList(distance, f)
generally has to pay the cost of sorting the split factor or at least checking that the split factor is already sorted.Another advantage of
relist(distance, skeleton)
is that it is guaranteed to return a List derivative that is parallel toskeleton
whenskeleton
is itself a List derivative.splitAsList(distance, f)
OTOH returns a List derivative that is parallel tosort(unique(f))
whenf
is an atomic vector. This means in particular that queries not represented inqueryHits(hits)
are not represented insplitAsList(distance, queryHits(hits))
either, which could be problematic in some contexts.H.
Thanks for reminding me (again) of the split / unlist problems; I updated my comment to use split / unsplit.
Wouldn't the code be faster with
?
The only problem is that the code errors:
I am comparing genomicranges with pyranges for a preprint I am writing and I am trying to optimize the competing libraries as much as possible.