1
3.6 years ago by
t.kuilman140
Netherlands
t.kuilman140 wrote:

I have been using BiocParallel for a while (using sockets and snow; please see below) and have noted that at least using these settings there seems to be no load balancing (the distribution of the work seems to occur only once, right at the beginning. Therefore workers that are faster (or have less calculations to do) are sometimes inactive while there are jobs queued for other slower workers). Are there settings of BiocParallel that would allow load balancing (for instance by changing to SerialParam / MulticoreParam / BatchJobsParam / DoparParam instances or by just changing the settings for SnowParam).

Thank you very much,

Thomas Kuilman

library(BiocParallel)
bp.param <- SnowParam(workers = 32, type = "SOCK")
bplapply(x, FUN, BPPARAM = bp.param)

biocparallel • 694 views
modified 3.6 years ago • written 3.6 years ago by t.kuilman140
3
3.6 years ago by
Martin Morgan ♦♦ 23k
United States
Martin Morgan ♦♦ 23k wrote:

Use the 'tasks' argument to the param object. The following uses the default tasks=0, which splits tasks as evenly as possible, sending 10s of work to worker 1, 2s to worker 2.

> library(BiocParallel)
> x = c(5, 5, 1, 1)
> system.time(bplapply(x, Sys.sleep, BPPARAM=SnowParam(2)))
starting worker localhost:11683
starting worker localhost:11683
user  system elapsed
0.033   0.006  11.012 

On the other hand tasks=4 sends one task to each worker, then when the worker returns sends the next, so probably 5s + 1s to worker 1, 5s + 1s to worker 2.

> system.time(bplapply(x, Sys.sleep, BPPARAM=SnowParam(2, tasks=4))) starting worker localhost:11683 starting worker localhost:11683    user  system elapsed    0.013   0.007   7.057 

It's worth noting that the workers are re-used, not restarted for each task. Probably large numbers of workers implies a single machine with multiple cores; it's probably the case that MulticoreParam() is a better choice in this case.

Thank you very much for the crisp and illuminating example you provide; I missed the tasks parameter but that obviously is the one I need to set. Just out of curiosity: you mention that it would be better to use MulticoreParam() on  a single machine with multiple cores; why would that be better and would that provide load balancing out-of-the-box?

MulticoreParam() does not load balance automatically (more tasks implies more communication / data transfer between the manager and workers, so is potentially more expensive in terms of time).

MulticoreParam() is a 'fork' of the original process. Each fork shares memory with the original process, so for instance it is not necessary to load packages on workers. In principle the fork is also memory efficient, since the memory in the fork is only copied when it is changed by the fork. However, R's garbage collector touches almost all R objects, likely triggering a copy in each fork; this was pointed out here.