Question

Error: subsetting GenomicRanges with disjoint ranges produces errors

0

Entering edit mode

ripkrizbi • 0

@ripkrizbi-9162

Last seen 10.0 years ago

Croatia

I'm wondering if this is a feature or a bug, but when I try to subset GRanges object using a disjoint IRanges, I get a funny error. Any idea on how to solve this? Thanks!

> ir <- IRanges(start=c(1:5, 7:9), end=c(1:5, 7:9)) > ir IRanges of length 8 start end width [1] 1 1 1 [2] 2 2 1 [3] 3 3 1 [4] 4 4 1 [5] 5 5 1 [6] 7 7 1 [7] 8 8 1 [8] 9 9 1 > gr <- GRanges(seqnames="dummy", ranges=ir) > gr GRanges object with 8 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] dummy [1, 1] * [2] dummy [2, 2] * [3] dummy [3, 3] * [4] dummy [4, 4] * [5] dummy [5, 5] * [6] dummy [7, 7] * [7] dummy [8, 8] * [8] dummy [9, 9] * ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths > gr[ir] Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) : subscript contains out-of-bounds ranges

Even more strange result (this is clearly wrong):

> head(ir)
IRanges of length 6
    start end width
[1]     1   1     1
[2]     2   2     1
[3]     3   3     1
[4]     4   4     1
[5]     5   5     1
[6]     7   7     1

> gr[head(ir)]
GRanges object with 6 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]    dummy    [1, 1]      *
  [2]    dummy    [2, 2]      *
  [3]    dummy    [3, 3]      *
  [4]    dummy    [4, 4]      *
  [5]    dummy    [5, 5]      *
  [6]    dummy    [8, 8]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

> sessionInfo() R version 3.2.2 Patched (2015-11-06 r69615) Platform: x86_64-suse-linux-gnu (64-bit) Running under: openSUSE 13.2 (Harlequin) (x86_64)

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] GenomicRanges_1.22.1 GenomeInfoDb_1.6.1 IRanges_2.4.1 [4] S4Vectors_0.8.1 BiocGenerics_0.16.1 BiocInstaller_1.20.0

loaded via a namespace (and not attached): [1] zlibbioc_1.16.0 XVector_0.10.0 tools_3.2.2

software error genomicranges • 2.8k views

ADD COMMENT • link 10.0 years ago ripkrizbi • 0

0

Entering edit mode

Ok, thanks guys!

- this is counterintuitive indeed and perhaps a suggestion would be to somehow disallow it in order to avoid confusion. There is no point in accepting range objects as indices only to extract the start coordinates and use them as index. If someone wanted to do that (why use ranges then in the first place and not a vector?) better have them do it explicitly...

Or, if indexing is done by ranges, why not directly use subsetByOverlaps method?

K

ADD REPLY • link 10.0 years ago ripkrizbi • 0

0

Entering edit mode

It does not do it by the start, it selects all positions within the ranges, in accordance with as.integer(ir). It's essentially a way to compress integer subscripts and has proven useful. Btw, many years ago, my original implementation did behave how you expected, but that behavior was replaced by subsetByOverlaps(). But do note that even calling subsetByOverlaps(granges, iranges) will not work, since the IRanges has no notion of sequence name.

ADD REPLY • link 10.0 years ago Michael Lawrence ★ 11k

0

Entering edit mode

subsetByOverlaps() actually does not cut it for me, because I need to assign metadata to a subset of the original GRange ranges, so I have to findOverlaps() and index with the result. Just out of curiosity, can you give an example where the present implementation might be useful?

ADD REPLY • link 10.0 years ago ripkrizbi • 0

0

Entering edit mode

Hi,

subsetByOverlaps() presents an interface similar to findOverlaps() in order to let you control the type of overlap that you want. We would not be able to do that if [ was used instead of subsetByOverlaps().

The present implementation is useful in any situation where your subscript describes long blocks of adjacent elements in the object to subset. In that case representing the subscript as an IRanges object is more efficient than representing it as an integer vector:

library(IRanges)
set.seed(2009)
score <- Rle(sample(10, 5e7, replace=TRUE))
i <- score >= 3
object.size(i)
# 128067616 bytes
ir <- as(i, "IRanges")
object.size(ir)
# 64034736 bytes

Also when you use ir to subset an object, the code that performs the subsetting is written in C and can take advantage of the efficiency of memcpy(). This is used in many places behind the scene like for example when you split an object by an Rle. It can make a significant difference if the object to split is big e.g. if you split a GRanges object x with 50 millions elements by seqnames(x).

FYI these days you can even subset a named List by a GRanges object (a feature that was requested a couple of years ago). This doesn't involve any notion of overlap either:

library(GenomicRanges)
example(GRanges)
x <- RleList(chr1=101:120, chr2=2:-8, chr3=31:40)
x[gr]

More generally speaking my feeling is that subsetting (with [) is a low-level operation and it seems wrong to make it involve a high level operation like finding overlaps.

Cheers,

H.

ADD REPLY • link 10.0 years ago Hervé Pagès 16k

score 1 · Accepted Answer · 2015-11-11

1

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 4.0 years ago

United States

It can be a little confusing, but the ranges in the IRanges object are treated as integer subscripts into the GRanges. There is no overlap computation. So the last range refers to position 9, while the GRanges is of length 8.

ADD COMMENT • link 10.0 years ago Michael Lawrence ★ 11k

0

Entering edit mode

To emphasize: when i is a Ranges object, x[i] is semantically equivalent to x[as.integer(i)], If you want the subsetting to take into account the overlapping ranges between x and i, please use subsetByOverlaps(x, i) instead.

H.

ADD REPLY • link 10.0 years ago Hervé Pagès 16k