I'm wondering if this is a feature or a bug, but when I try to subset GRanges object using a disjoint IRanges, I get a funny error. Any idea on how to solve this? Thanks!
> ir <- IRanges(start=c(1:5, 7:9), end=c(1:5, 7:9))
> ir
IRanges of length 8
start end width
[1] 1 1 1
[2] 2 2 1
[3] 3 3 1
[4] 4 4 1
[5] 5 5 1
[6] 7 7 1
[7] 8 8 1
[8] 9 9 1
> gr <- GRanges(seqnames="dummy", ranges=ir)
> gr
GRanges object with 8 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] dummy [1, 1] *
[2] dummy [2, 2] *
[3] dummy [3, 3] *
[4] dummy [4, 4] *
[5] dummy [5, 5] *
[6] dummy [7, 7] *
[7] dummy [8, 8] *
[8] dummy [9, 9] *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
> gr[ir]
Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
subscript contains out-of-bounds ranges
Even more strange result (this is clearly wrong):
> head(ir) IRanges of length 6 start end width [1] 1 1 1 [2] 2 2 1 [3] 3 3 1 [4] 4 4 1 [5] 5 5 1 [6] 7 7 1 > gr[head(ir)] GRanges object with 6 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] dummy [1, 1] * [2] dummy [2, 2] * [3] dummy [3, 3] * [4] dummy [4, 4] * [5] dummy [5, 5] * [6] dummy [8, 8] * ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths
> sessionInfo()
R version 3.2.2 Patched (2015-11-06 r69615)
Platform: x86_64-suse-linux-gnu (64-bit)
Running under: openSUSE 13.2 (Harlequin) (x86_64)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] GenomicRanges_1.22.1 GenomeInfoDb_1.6.1 IRanges_2.4.1
[4] S4Vectors_0.8.1 BiocGenerics_0.16.1 BiocInstaller_1.20.0
loaded via a namespace (and not attached):
[1] zlibbioc_1.16.0 XVector_0.10.0 tools_3.2.2
Ok, thanks guys!
- this is counterintuitive indeed and perhaps a suggestion would be to somehow disallow it in order to avoid confusion. There is no point in accepting range objects as indices only to extract the start coordinates and use them as index. If someone wanted to do that (why use ranges then in the first place and not a vector?) better have them do it explicitly...
Or, if indexing is done by ranges, why not directly use subsetByOverlaps method?
K
It does not do it by the start, it selects all positions within the ranges, in accordance with
as.integer(ir)
. It's essentially a way to compress integer subscripts and has proven useful. Btw, many years ago, my original implementation did behave how you expected, but that behavior was replaced bysubsetByOverlaps()
. But do note that even callingsubsetByOverlaps(granges, iranges)
will not work, since the IRanges has no notion of sequence name.subsetByOverlaps()
actually does not cut it for me, because I need to assign metadata to a subset of the originalGRange
ranges, so I have tofindOverlaps()
and index with the result. Just out of curiosity, can you give an example where the present implementation might be useful?Hi,
subsetByOverlaps()
presents an interface similar tofindOverlaps()
in order to let you control the type of overlap that you want. We would not be able to do that if[
was used instead ofsubsetByOverlaps()
.The present implementation is useful in any situation where your subscript describes long blocks of adjacent elements in the object to subset. In that case representing the subscript as an IRanges object is more efficient than representing it as an integer vector:
Also when you use
ir
to subset an object, the code that performs the subsetting is written in C and can take advantage of the efficiency ofmemcpy()
. This is used in many places behind the scene like for example when you split an object by an Rle. It can make a significant difference if the object to split is big e.g. if you split a GRanges objectx
with 50 millions elements byseqnames(x)
.FYI these days you can even subset a named List by a GRanges object (a feature that was requested a couple of years ago). This doesn't involve any notion of overlap either:
More generally speaking my feeling is that subsetting (with
[
) is a low-level operation and it seems wrong to make it involve a high level operation like finding overlaps.Cheers,
H.