Speed efficient way to select the n-th element from GRangsList
2
1
Entering edit mode
foehn ▴ 100
@foehn-16281
Last seen 2.7 years ago
United States
Hello,

Suppose we have a simple task, that is, to select the fifth range from each GRanges object from a GRangesList object. I can think of two methods:

library(EnsDb.Hsapiens.v86)
library(microbenchmark)
X <- exonsBy(EnsDb.Hsapiens.v86, by = "tx")
X <- X[lengths(X) > 10] # to make sure every `GRanges` object has at least 11 elements;
X <- X[1:100] # to make a smaller dataset

# Method 1
> microbenchmark(res <- endoapply(X, FUN = "[", 5))
Unit: milliseconds
                              expr     min       lq     mean   median       uq
 res <- endoapply(X, FUN = "[", 5) 726.677 841.4338 879.6332 872.5642 905.1298
      max neval
 1755.904   100

 

# Method 2
> microbenchmark(res <- lapply(X, FUN = "[", 5))
Unit: milliseconds
                           expr      min     lq     mean   median      uq
 res <- lapply(X, FUN = "[", 5) 594.1233 638.44 710.8889 682.4832 739.132
      max neval
 1616.825   100
# `res` is of `list` class, so it needs to be converted into `GRangesList` as `endoapply` does.  
> microbenchmark(as(res, "GRangesList"))
Unit: milliseconds
                   expr      min       lq     mean   median       uq     max
 as(res, "GRangesList") 124.2346 137.1553 187.5402 152.5325 252.8355 353.438
 neval
   100

The benchmark test shows that both methods have comparable speed, and neither is fast. 

I'm wondering if there is any faster way to do it, besides using multi-core parallel computing.  Thanks.

 

 

---------------------------------------------------------------------------------------------------------------------------------------------------

Update1. Thanks to @Martin Morgan's suggestion; it makes things 100 times faster.

> microbenchmark(as(unlist(X)[cumsum(c(0, head(lengths(X), -1))) + 5], "GRangesList"))
Unit: milliseconds
                                                                 expr      min
 as(unlist(X)[cumsum(c(0, head(lengths(X), -1))) + 5], "GRangesList") 4.518822
       lq     mean   median       uq      max neval
 5.559363 7.725312 6.223389 7.224878 100.3304   100

Update 2. According to @Martin Morgan's second suggestion, splitAsList goes even faster than as(..., "GRangesList").

> microbenchmark(splitAsList(unlist(X)[cumsum(c(0, head(lengths(X), -1))) + 5], 1:100))
Unit: milliseconds
                                                                       expr
 splitAsList(unlist(X)[cumsum(c(0, head(lengths(X), -1))) + 5],      1:100)
      min       lq     mean   median       uq      max neval
 4.748592 4.852561 6.094437 4.961308 5.099946 105.5112   100

Update 3. Thanks to @Michael Lawrence's answer too. Indeed it takes about twice the amount of time.

> microbenchmark(res <- as(X[rep(IntegerList(5L), length(X))], "GRangesList"))
Unit: milliseconds
                                                         expr     min       lq
 res <- as(X[rep(IntegerList(5L), length(X))], "GRangesList") 10.4255 11.49089
     mean   median       uq     max neval
 13.41262 12.74597 14.06349 25.7567   100

If we break down the process,

> microbenchmark(i <- rep(IntegerList(5L), length(X)))
Unit: milliseconds
                                 expr      min       lq    mean   median
 i <- rep(IntegerList(5L), length(X)) 5.770752 7.612974 9.61129 8.744468
       uq      max neval
 10.86042 38.87412   100
> microbenchmark(res <- X[i])
Unit: milliseconds
        expr      min       lq     mean   median       uq      max neval
 res <- X[i] 5.101723 5.827044 6.797364 6.401072 7.196423 14.13441   100
> microbenchmark(as(res, "GRangesList"))
Unit: microseconds
                   expr   min     lq    mean median    uq   max neval
 as(res, "GRangesList") 2.064 2.1705 2.39714   2.24 2.313 16.42   100

we can see that constructing a CompressedIntegerList object is the main time consuming step, while subsetting takes similar amount of time as @Martin Morgan's method.

In summary, @Martin Morgan's solution is faster, while @Michael Lawrence's seems more versatile. Both are much more efficient than the original methods, thanks. 

 

R GenomicRanges IRanges • 1.1k views
ADD COMMENT
2
Entering edit mode
@michael-lawrence-3846
Last seen 2.4 years ago
United States

You could also use List-based subsetting.

X[rep(IntegerList(5L), length(X))]

That takes about twice the time as Martin's example but one could argue that is more comprehensible once one understands how to subset by a List.

Also, instead of splitAsList(), you could just coerce the result to a GRangesList. It takes the same amount of time.

as(unlist(X)[offset], "GRangesList")

 

ADD COMMENT
1
Entering edit mode
@martin-morgan-1513
Last seen 7 days ago
United States

It's fast to unlist a GRangesList, so figure out the offsets of interest

offset = cumsum(head(c(0, lengths(X)), -1)) + 5

and subset the unlisted elements

unlist(X)[offset]

For GRangesList (but why?) use splitAsList(unlist(X)[offset], seq_along(offset)).

ADD COMMENT

Login before adding your answer.

Traffic: 572 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6