Hello,
Suppose we have a simple task, that is, to select the fifth range from each GRanges object from a GRangesList object. I can think of two methods:
library(EnsDb.Hsapiens.v86)
library(microbenchmark)
X <- exonsBy(EnsDb.Hsapiens.v86, by = "tx")
X <- X[lengths(X) > 10] # to make sure every `GRanges` object has at least 11 elements;
X <- X[1:100] # to make a smaller dataset
# Method 1
> microbenchmark(res <- endoapply(X, FUN = "[", 5))
Unit: milliseconds
                              expr     min       lq     mean   median       uq
 res <- endoapply(X, FUN = "[", 5) 726.677 841.4338 879.6332 872.5642 905.1298
      max neval
 1755.904   100
# Method 2
> microbenchmark(res <- lapply(X, FUN = "[", 5))
Unit: milliseconds
                           expr      min     lq     mean   median      uq
 res <- lapply(X, FUN = "[", 5) 594.1233 638.44 710.8889 682.4832 739.132
      max neval
 1616.825   100
# `res` is of `list` class, so it needs to be converted into `GRangesList` as `endoapply` does.  
> microbenchmark(as(res, "GRangesList"))
Unit: milliseconds
                   expr      min       lq     mean   median       uq     max
 as(res, "GRangesList") 124.2346 137.1553 187.5402 152.5325 252.8355 353.438
 neval
   100
The benchmark test shows that both methods have comparable speed, and neither is fast.
I'm wondering if there is any faster way to do it, besides using multi-core parallel computing. Thanks.
---------------------------------------------------------------------------------------------------------------------------------------------------
Update1. Thanks to @Martin Morgan's suggestion; it makes things 100 times faster.
> microbenchmark(as(unlist(X)[cumsum(c(0, head(lengths(X), -1))) + 5], "GRangesList"))
Unit: milliseconds
                                                                 expr      min
 as(unlist(X)[cumsum(c(0, head(lengths(X), -1))) + 5], "GRangesList") 4.518822
       lq     mean   median       uq      max neval
 5.559363 7.725312 6.223389 7.224878 100.3304   100
Update 2. According to @Martin Morgan's second suggestion, splitAsList goes even faster than as(..., "GRangesList").
> microbenchmark(splitAsList(unlist(X)[cumsum(c(0, head(lengths(X), -1))) + 5], 1:100))
Unit: milliseconds
                                                                       expr
 splitAsList(unlist(X)[cumsum(c(0, head(lengths(X), -1))) + 5],      1:100)
      min       lq     mean   median       uq      max neval
 4.748592 4.852561 6.094437 4.961308 5.099946 105.5112   100
Update 3. Thanks to @Michael Lawrence's answer too. Indeed it takes about twice the amount of time.
> microbenchmark(res <- as(X[rep(IntegerList(5L), length(X))], "GRangesList"))
Unit: milliseconds
                                                         expr     min       lq
 res <- as(X[rep(IntegerList(5L), length(X))], "GRangesList") 10.4255 11.49089
     mean   median       uq     max neval
 13.41262 12.74597 14.06349 25.7567   100
If we break down the process,
> microbenchmark(i <- rep(IntegerList(5L), length(X)))
Unit: milliseconds
                                 expr      min       lq    mean   median
 i <- rep(IntegerList(5L), length(X)) 5.770752 7.612974 9.61129 8.744468
       uq      max neval
 10.86042 38.87412   100
> microbenchmark(res <- X[i])
Unit: milliseconds
        expr      min       lq     mean   median       uq      max neval
 res <- X[i] 5.101723 5.827044 6.797364 6.401072 7.196423 14.13441   100
> microbenchmark(as(res, "GRangesList"))
Unit: microseconds
                   expr   min     lq    mean median    uq   max neval
 as(res, "GRangesList") 2.064 2.1705 2.39714   2.24 2.313 16.42   100
we can see that constructing a CompressedIntegerList object is the main time consuming step, while subsetting takes similar amount of time as @Martin Morgan's method.
In summary, @Martin Morgan's solution is faster, while @Michael Lawrence's seems more versatile. Both are much more efficient than the original methods, thanks.
