Efficiently construct GRanges/IRanges from Rle vector
1
0
Entering edit mode
my4 ▴ 10
@my4-10129
Last seen 4.9 years ago

I have a Run length encoded vector representing some value at every position on the genome, in order. As a toy example suppose I had just one chromosome of length 10, then I would have a vector looking like

library(GenomicRanges)

set.seed(1)
toyData = Rle(sample(1:3,10,replace=TRUE))

I would like to coerce this into a GRanges object. The best I can come up with is

gr = GRanges('toyChr',IRanges(start(toyData)-1,
width=runLength(toyData)),
toyData = runValue(toyData))

which works, but is quite slow. Is there a faster way to construct the same object?

Rle granges iranges • 909 views
0
Entering edit mode

For a chromosome length object, when I run this

topData = Rle(sample(1:3,1e8,replace=TRUE))

microbenchmark(GRanges('toyChr',IRanges(start(toyData)-1, width=runLength(toyData), toyData = runValue(toyData)), times = 3)

It takes roughly 15 seconds to convert

Unit: seconds
expr
GRanges("toyChr", IRanges(start(toyData) - 1, width = runLength(toyData)),      toyData = runValue(toyData))
min       lq     mean  median       uq      max neval
14.8322 15.21285 15.58755 15.5935 15.96522 16.33694     3

0
Entering edit mode

0
Entering edit mode
@michael-lawrence-3846
Last seen 10 months ago
United States

On my machine, in devel, that example takes about 3 seconds. 1/3 of the time is spent making the IRanges, and the rest is basically in the validity check.

Btw, I don't think you should subtract one from the start. IRanges is 1-based.

Calling as(x, "GRanges") is the canonical way to convert an RleList (multiple chromosomes) to a GRanges where each range represents a run in the Rle. I just rewrote it in devel so that it's a bit faster, but it's still going to be slow due to the overhead of lists. Would be nice to be able to skip some of the validity checking internally. What do people think?