Hello,
Suppose we have a simple task, that is, to select the fifth range from each GRanges
object from a GRangesList
object. I can think of two methods:
library(EnsDb.Hsapiens.v86) library(microbenchmark) X <- exonsBy(EnsDb.Hsapiens.v86, by = "tx") X <- X[lengths(X) > 10] # to make sure every `GRanges` object has at least 11 elements; X <- X[1:100] # to make a smaller dataset # Method 1 > microbenchmark(res <- endoapply(X, FUN = "[", 5)) Unit: milliseconds expr min lq mean median uq res <- endoapply(X, FUN = "[", 5) 726.677 841.4338 879.6332 872.5642 905.1298 max neval 1755.904 100
# Method 2 > microbenchmark(res <- lapply(X, FUN = "[", 5)) Unit: milliseconds expr min lq mean median uq res <- lapply(X, FUN = "[", 5) 594.1233 638.44 710.8889 682.4832 739.132 max neval 1616.825 100 # `res` is of `list` class, so it needs to be converted into `GRangesList` as `endoapply` does. > microbenchmark(as(res, "GRangesList")) Unit: milliseconds expr min lq mean median uq max as(res, "GRangesList") 124.2346 137.1553 187.5402 152.5325 252.8355 353.438 neval 100
The benchmark test shows that both methods have comparable speed, and neither is fast.
I'm wondering if there is any faster way to do it, besides using multi-core parallel computing. Thanks.
---------------------------------------------------------------------------------------------------------------------------------------------------
Update1. Thanks to @Martin Morgan's suggestion; it makes things 100 times faster.
> microbenchmark(as(unlist(X)[cumsum(c(0, head(lengths(X), -1))) + 5], "GRangesList")) Unit: milliseconds expr min as(unlist(X)[cumsum(c(0, head(lengths(X), -1))) + 5], "GRangesList") 4.518822 lq mean median uq max neval 5.559363 7.725312 6.223389 7.224878 100.3304 100
Update 2. According to @Martin Morgan's second suggestion, splitAsList
goes even faster than as(..., "GRangesList")
.
> microbenchmark(splitAsList(unlist(X)[cumsum(c(0, head(lengths(X), -1))) + 5], 1:100)) Unit: milliseconds expr splitAsList(unlist(X)[cumsum(c(0, head(lengths(X), -1))) + 5], 1:100) min lq mean median uq max neval 4.748592 4.852561 6.094437 4.961308 5.099946 105.5112 100
Update 3. Thanks to @Michael Lawrence's answer too. Indeed it takes about twice the amount of time.
> microbenchmark(res <- as(X[rep(IntegerList(5L), length(X))], "GRangesList")) Unit: milliseconds expr min lq res <- as(X[rep(IntegerList(5L), length(X))], "GRangesList") 10.4255 11.49089 mean median uq max neval 13.41262 12.74597 14.06349 25.7567 100
If we break down the process,
> microbenchmark(i <- rep(IntegerList(5L), length(X))) Unit: milliseconds expr min lq mean median i <- rep(IntegerList(5L), length(X)) 5.770752 7.612974 9.61129 8.744468 uq max neval 10.86042 38.87412 100 > microbenchmark(res <- X[i]) Unit: milliseconds expr min lq mean median uq max neval res <- X[i] 5.101723 5.827044 6.797364 6.401072 7.196423 14.13441 100 > microbenchmark(as(res, "GRangesList")) Unit: microseconds expr min lq mean median uq max neval as(res, "GRangesList") 2.064 2.1705 2.39714 2.24 2.313 16.42 100
we can see that constructing a CompressedIntegerList object is the main time consuming step, while subsetting takes similar amount of time as @Martin Morgan's method.
In summary, @Martin Morgan's solution is faster, while @Michael Lawrence's seems more versatile. Both are much more efficient than the original methods, thanks.