Hello,
Suppose we have a simple task, that is, to select the fifth range from each GRanges object from a GRangesList object. I can think of two methods:
library(EnsDb.Hsapiens.v86)
library(microbenchmark)
X <- exonsBy(EnsDb.Hsapiens.v86, by = "tx")
X <- X[lengths(X) > 10] # to make sure every `GRanges` object has at least 11 elements;
X <- X[1:100] # to make a smaller dataset
# Method 1
> microbenchmark(res <- endoapply(X, FUN = "[", 5))
Unit: milliseconds
expr min lq mean median uq
res <- endoapply(X, FUN = "[", 5) 726.677 841.4338 879.6332 872.5642 905.1298
max neval
1755.904 100
# Method 2
> microbenchmark(res <- lapply(X, FUN = "[", 5))
Unit: milliseconds
expr min lq mean median uq
res <- lapply(X, FUN = "[", 5) 594.1233 638.44 710.8889 682.4832 739.132
max neval
1616.825 100
# `res` is of `list` class, so it needs to be converted into `GRangesList` as `endoapply` does.
> microbenchmark(as(res, "GRangesList"))
Unit: milliseconds
expr min lq mean median uq max
as(res, "GRangesList") 124.2346 137.1553 187.5402 152.5325 252.8355 353.438
neval
100
The benchmark test shows that both methods have comparable speed, and neither is fast.
I'm wondering if there is any faster way to do it, besides using multi-core parallel computing. Thanks.
---------------------------------------------------------------------------------------------------------------------------------------------------
Update1. Thanks to @Martin Morgan's suggestion; it makes things 100 times faster.
> microbenchmark(as(unlist(X)[cumsum(c(0, head(lengths(X), -1))) + 5], "GRangesList"))
Unit: milliseconds
expr min
as(unlist(X)[cumsum(c(0, head(lengths(X), -1))) + 5], "GRangesList") 4.518822
lq mean median uq max neval
5.559363 7.725312 6.223389 7.224878 100.3304 100
Update 2. According to @Martin Morgan's second suggestion, splitAsList goes even faster than as(..., "GRangesList").
> microbenchmark(splitAsList(unlist(X)[cumsum(c(0, head(lengths(X), -1))) + 5], 1:100))
Unit: milliseconds
expr
splitAsList(unlist(X)[cumsum(c(0, head(lengths(X), -1))) + 5], 1:100)
min lq mean median uq max neval
4.748592 4.852561 6.094437 4.961308 5.099946 105.5112 100
Update 3. Thanks to @Michael Lawrence's answer too. Indeed it takes about twice the amount of time.
> microbenchmark(res <- as(X[rep(IntegerList(5L), length(X))], "GRangesList"))
Unit: milliseconds
expr min lq
res <- as(X[rep(IntegerList(5L), length(X))], "GRangesList") 10.4255 11.49089
mean median uq max neval
13.41262 12.74597 14.06349 25.7567 100
If we break down the process,
> microbenchmark(i <- rep(IntegerList(5L), length(X)))
Unit: milliseconds
expr min lq mean median
i <- rep(IntegerList(5L), length(X)) 5.770752 7.612974 9.61129 8.744468
uq max neval
10.86042 38.87412 100
> microbenchmark(res <- X[i])
Unit: milliseconds
expr min lq mean median uq max neval
res <- X[i] 5.101723 5.827044 6.797364 6.401072 7.196423 14.13441 100
> microbenchmark(as(res, "GRangesList"))
Unit: microseconds
expr min lq mean median uq max neval
as(res, "GRangesList") 2.064 2.1705 2.39714 2.24 2.313 16.42 100
we can see that constructing a CompressedIntegerList object is the main time consuming step, while subsetting takes similar amount of time as @Martin Morgan's method.
In summary, @Martin Morgan's solution is faster, while @Michael Lawrence's seems more versatile. Both are much more efficient than the original methods, thanks.
