Question

slow bplapply?

0

Entering edit mode

Wu, Hao ▴ 20

@wu-hao-6728

Last seen 10 months ago

United States

I recently noticed that the bplapply seems work differently from the older version.

For example, if I want to do a number of lm

library(BiocParallel)

# simulate some data for 10000 regressions
N = 10000
nobs = 100
X = matrix( rnorm(nobs*3), ncol=3 )
Y = matrix(rnorm(N*nobs), nrow=N)
numCores = 4

# a function to loop through the regressions
foo <- function(i) {
    lm(Y[i,]~X)$coef[2]
}

# bpapply is slow, it takes over 6 seconds on my laptop (Mac book pro)
mParam = MulticoreParam(workers=numCores, progressbar=FALSE)
t0 = Sys.time()
beta = bplapply(1:N, foo, BPPARAM = mParam)
t1 = Sys.time()
difftime(t1, t0, unit="secs") 

# it takes 1 second for mcapply
library(doParallel)
t0 = Sys.time()
beta = mclapply(1:N, foo, mc.cores = numCores)
t1 = Sys.time()
difftime(t1, t0, unit="secs")

I'm not exactly sure what's going on, since bplapply worked perfectly fine before for this. I read the manual, and it seems bplapply now passes each row of Y to the workers, instead of dividing all data into a few chunks. It seems bpvec can achieve that, but it requires extra programming. Can someone provide a good solution? Or I should just switch to mcapply?


> sessionInfo()

R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] doParallel_1.0.16   iterators_1.0.13    foreach_1.5.1      
[4] BiocParallel_1.28.0

loaded via a namespace (and not attached):
[1] compiler_4.1.2   codetools_0.2-18

BiocParallel • 1.4k views

ADD COMMENT • link 2.5 years ago • updated 16 months ago Wu, Hao ▴ 20

0

Entering edit mode

I trie bpvec function a little bit, but got a mysterious error

N = 10000
nobs = 100
X = matrix( rnorm(nobs*3), ncol=3 )
Y = matrix(rnorm(N*nobs), nrow=N)
numCores = 4

library(BiocParallel)
mParam = MulticoreParam(workers=numCores, progressbar=FALSE)
foo <- function(idx) {
    res <- rep(0, length(idx))
    for(i in idx)
        res[i] <- lm(Y[i,]~X)$coef[2]
    res
}

t0 = Sys.time()
beta = bpvec(1:N, foo, BPPARAM = mParam)
t1 = Sys.time()
difftime(t1, t0, unit="secs")

It tells me "Error: length(FUN(X)) not equal to length(X)". This is strange since my function "foo" returns a vector of the same length as the input.

ADD REPLY • link 2.5 years ago Wu, Hao ▴ 20

0

Entering edit mode

I found that there is a _task_ parameter in _MulticoreParam_ that can evenly divide and distribute the jobs. However, it's not making it faster.

library(BiocParallel)

# simulate some data for 10000 regressions                                                                
N = 10000
nobs = 100
X = matrix( rnorm(nobs*3), ncol=3 )
Y = matrix(rnorm(N*nobs), nrow=N)
numCores = 4

# a function to loop through the regressions                                                              
foo <- function(i) {
    lm(Y[i,]~X)$coef[2]
}

# bpapply is still slow                                                                                   
mParam = MulticoreParam(workers=numCores, tasks = 0)
t0 = Sys.time()
beta = bplapply(1:N, foo, BPPARAM = mParam)
t1 = Sys.time()
difftime(t1, t0, unit="secs")

I found that using SnowParam with FORK works well:

snowFORK <- SnowParam(workers = numCores, type = "FORK")
t0 = Sys.time()
beta = bplapply(1:N, foo, BPPARAM = snowFORK)
t1 = Sys.time()
difftime(t1, t0, unit="secs")

So what's going on with MulticoreParam?

ADD REPLY • link 2.5 years ago Wu, Hao ▴ 20

0

Entering edit mode

If I remember correctly, there was a change in how the environment of R was loaded in the different background jobs. You should probably read the BiocParallel NEWS, it might be related to:

o (v 1.31.13 / v 1.30.4) only export variables in .GlobalEnv or package:
https://github.com/Bioconductor/BiocParallel/issues/223
https://github.com/Bioconductor/BiocParallel/pull/224

But you might have more luck in getting an answer in the bioc-devel mailing list or the slack

ADD REPLY • link 16 months ago Lluís Revilla Sancho ▴ 730

1

Entering edit mode

This is a post from 13 months ago; sorry it slipped through at the time. Here are current timings for me

9.275732 SerialParam
8.231515 MulticoreParam
2.275638 SnowParam, type = "FORK"
2.154827 mclapply

The default tasks already divides the work in the best way for this computation (and is the same strategy across each of the *Param).

bpvec() is not being used correctly. In a simplified example, if X = 1:10 were distributed over two workers, the first worker might receive 1:5 and the second worker 6:10. foo(1:5) is ok; length(foo(1:5)) == length(1:5). foo(6:10) is problematic because as written length(foo(6:10)) is 10 and not 5. An updated version is

foo <- function(idx) {
    res <- rep(0, length(idx))
    for(i in seq_along(idx))
        res[i] <- lm(Y[idx[i],]~X)$coef[2]
    res
}

This takes about 2.24 seconds when run with MulticoreParam() -- bpvec() would be even more useful if the implementation of foo() itself were 'vectorized', as e.g., in the trivial example in the help page ?bpvec where the body is sqrt(idx) rather than an iteration as here.

This catches us up to the question -- why is MulticoreParam being slow? In some respects it is implemented differently from mclapply() and so performance is not expected to be identical, but I'm surprised that SnowParam(type = "FORK") is relatively fast; I will look into this...

UPDATE The reason for the slowness is because MulticoreParam() includes a 'garbage collection' parameter force.GC set to TRUE by default. Setting this to FALSE recovers the performance of mclapply().

system.time({
    beta <- bplapply(1:N, foo, BPPARAM = MulticoreParam(4, force.GC = FALSE))
})
##   user  system elapsed 
##  9.330   0.180   2.455

There is a (pretty hard to follow grammatically, and not mentioning performance trade-offs) mention in the NEWS file

CHANGES IN VERSION 1.28
-----------------------

USER VISIBLE CHANGES
...
    o (v. 1.27.9) By defualt, do _not_ only run garbage collection
    after every call to FUN(), except under MulticoreParam(). R's
    garbage collection algorithm only fails to do well when forked
    processes (i.e., MulticoreParam) assume that they are the only
    consumers of process memory.

and in the commit log. As mentioned in the commit, the motivation was to manage memory better in response to this pull request. I will rethink the approach and have opened an issue to track this.

Returning to the original task of extracting coefficients from repeated fitting the same model matrix, note that it is much more efficient to use lm.fit() than to use lm(); I have

bpvec_lm <- function(idx) {
    ## better than `foo()` reimplemented above; does not require explicit
    ## memory management
    vapply(idx, function(i) {
        lm(Y[i,] ~ X)$coefficients[2]
    }, numeric(1))
}

bpvec_lmfit <- function(idx) {
    x <- cbind(1, X) # model matrix computed once instead of N times
    vapply(idx, function(i) {
        lm.fit(x, Y[i,])$coefficients[2]
    }, numeric(1))
}

system.time(res1 <- bpvec_lm(1:N))
##    user  system elapsed 
##   8.449   0.188   8.659 
system.time(res2 <- bpvec_lmfit(1:N))
##   user  system elapsed 
##   0.742   0.052   0.800 
system.time(res3 <- bpvec(1:N, bpvec_lmfit, BPPARAM = MulticoreParam(4)))
##    user  system elapsed 
##   0.826   0.093   0.274 
identical(res1, res2)
## TRUE
identical(res1, res3)
## TRUE

For a 10x speedup WITHOUT parallel evaluation.

ADD REPLY • link 16 months ago Martin Morgan 25k

0

Entering edit mode

Thanks for the answers. I'll look into this.

For the lm example: I was not trying to do lm, just used that as a dummy example to try the parallel functions. Of course if I really want to do many lm I'll do solve(t(X)%*%X) %*% (t(X)%*%Y), which will be 100 times faster.

ADD REPLY • link 16 months ago Wu, Hao ▴ 20