Question: Error: subsetting GenomicRanges with disjoint ranges produces errors
0
3.5 years ago by
Croatia
ripkrizbi0 wrote:

I'm wondering if this is a feature or a bug, but when I try to subset GRanges object using a disjoint IRanges, I get a funny error. Any idea on how to solve this? Thanks!

> ir <- IRanges(start=c(1:5, 7:9), end=c(1:5, 7:9)) > ir IRanges of length 8     start end width [1]     1   1     1 [2]     2   2     1 [3]     3   3     1 [4]     4   4     1 [5]     5   5     1 [6]     7   7     1 [7]     8   8     1 [8]     9   9     1 > gr <- GRanges(seqnames="dummy", ranges=ir) > gr GRanges object with 8 ranges and 0 metadata columns:       seqnames    ranges strand          <Rle> <IRanges>  <Rle>   [1]    dummy    [1, 1]      *   [2]    dummy    [2, 2]      *   [3]    dummy    [3, 3]      *   [4]    dummy    [4, 4]      *   [5]    dummy    [5, 5]      *   [6]    dummy    [7, 7]      *   [7]    dummy    [8, 8]      *   [8]    dummy    [9, 9]      *   -------   seqinfo: 1 sequence from an unspecified genome; no seqlengths > gr[ir] Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :   subscript contains out-of-bounds ranges

Even more strange result (this is clearly wrong):

> head(ir)
IRanges of length 6
start end width
[1]     1   1     1
[2]     2   2     1
[3]     3   3     1
[4]     4   4     1
[5]     5   5     1
[6]     7   7     1

GRanges object with 6 ranges and 0 metadata columns:
seqnames    ranges strand
<Rle> <IRanges>  <Rle>
[1]    dummy    [1, 1]      *
[2]    dummy    [2, 2]      *
[3]    dummy    [3, 3]      *
[4]    dummy    [4, 4]      *
[5]    dummy    [5, 5]      *
[6]    dummy    [8, 8]      *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths



> sessionInfo() R version 3.2.2 Patched (2015-11-06 r69615) Platform: x86_64-suse-linux-gnu (64-bit) Running under: openSUSE 13.2 (Harlequin) (x86_64)

locale:  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C                [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8      [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8     [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                   [9] LC_ADDRESS=C               LC_TELEPHONE=C             [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages: [1] stats4    parallel  stats     graphics  grDevices utils     datasets [8] methods   base     

other attached packages: [1] GenomicRanges_1.22.1 GenomeInfoDb_1.6.1   IRanges_2.4.1        [4] S4Vectors_0.8.1      BiocGenerics_0.16.1  BiocInstaller_1.20.0

loaded via a namespace (and not attached): [1] zlibbioc_1.16.0 XVector_0.10.0  tools_3.2.2    

genomicranges software error • 1.1k views
modified 3.5 years ago • written 3.5 years ago by ripkrizbi0

Ok, thanks guys!

- this is counterintuitive indeed and perhaps a suggestion would be to somehow disallow it in order to avoid confusion. There is no point in accepting range objects as indices only to extract the start coordinates and use them as index. If someone wanted to do that (why use ranges then in the first place and not a vector?) better have them do it explicitly...

Or, if indexing is done by ranges, why not directly use subsetByOverlaps method?

K

It does not do it by the start, it selects all positions within the ranges, in accordance with as.integer(ir). It's essentially a way to compress integer subscripts and has proven useful. Btw, many years ago, my original implementation did behave how you expected, but that behavior was replaced by subsetByOverlaps(). But do note that even calling subsetByOverlaps(granges, iranges) will not work, since the IRanges has no notion of sequence name.

subsetByOverlaps() actually does not cut it for me, because I need to assign metadata to a subset of the original GRange ranges, so I have to findOverlaps() and index with the result. Just out of curiosity, can you give an example where the present implementation might be useful?

Hi,

subsetByOverlaps() presents an interface similar to findOverlaps() in order to let you control the type of overlap that you want. We would not be able to do that if [ was used instead of subsetByOverlaps().

The present implementation is useful in any situation where your subscript describes long blocks of adjacent elements in the object to subset. In that case representing the subscript as an IRanges object is more efficient than representing it as an integer vector:

library(IRanges)
set.seed(2009)
score <- Rle(sample(10, 5e7, replace=TRUE))
i <- score >= 3
object.size(i)
# 128067616 bytes
ir <- as(i, "IRanges")
object.size(ir)
# 64034736 bytes

Also when you use ir to subset an object, the code that performs the subsetting is written in C and can take advantage of the efficiency of memcpy(). This is used in many places behind the scene like for example when you split an object by an Rle. It can make a significant difference if the object to split is big e.g. if you split a GRanges object x with 50 millions elements by seqnames(x).

FYI these days you can even subset a named List by a GRanges object (a feature that was requested a couple of years ago). This doesn't involve any notion of overlap either:

library(GenomicRanges)
example(GRanges)
x <- RleList(chr1=101:120, chr2=2:-8, chr3=31:40)
x[gr]

More generally speaking my feeling is that subsetting (with [) is a low-level operation and it seems wrong to make it involve a high level operation like finding overlaps.

Cheers,

H.

Answer: Error: subsetting GenomicRanges with disjoint ranges produces errors
1
3.5 years ago by
United States
Michael Lawrence11k wrote:

It can be a little confusing, but the ranges in the IRanges object are treated as integer subscripts into the GRanges. There is no overlap computation. So the last range refers to position 9, while the GRanges is of length 8.

To emphasize: when i is a Ranges object, x[i] is semantically equivalent to x[as.integer(i)], If you want the subsetting to take into account the overlapping ranges between x and i, please use subsetByOverlaps(x, i) instead.

H.