Load balancing using BiocParallel
1
1
Entering edit mode
t.kuilman ▴ 170
@tkuilman-6868
Last seen 22 months ago
Netherlands

I have been using BiocParallel for a while (using sockets and snow; please see below) and have noted that at least using these settings there seems to be no load balancing (the distribution of the work seems to occur only once, right at the beginning. Therefore workers that are faster (or have less calculations to do) are sometimes inactive while there are jobs queued for other slower workers). Are there settings of BiocParallel that would allow load balancing (for instance by changing to SerialParam / MulticoreParam / BatchJobsParam / DoparParam instances or by just changing the settings for SnowParam).

Thank you very much,

Thomas Kuilman

library(BiocParallel)
bp.param <- SnowParam(workers = 32, type = "SOCK")
bplapply(x, FUN, BPPARAM = bp.param)

 

 

biocparallel • 1.4k views
ADD COMMENT
3
Entering edit mode
@martin-morgan-1513
Last seen 6 weeks ago
United States

Use the 'tasks' argument to the param object. The following uses the default tasks=0, which splits tasks as evenly as possible, sending 10s of work to worker 1, 2s to worker 2.

> library(BiocParallel)
> x = c(5, 5, 1, 1)
> system.time(bplapply(x, Sys.sleep, BPPARAM=SnowParam(2)))
starting worker localhost:11683
starting worker localhost:11683
   user  system elapsed 
  0.033   0.006  11.012 

On the other hand tasks=4 sends one task to each worker, then when the worker returns sends the next, so probably 5s + 1s to worker 1, 5s + 1s to worker 2.

> system.time(bplapply(x, Sys.sleep, BPPARAM=SnowParam(2, tasks=4)))
starting worker localhost:11683
starting worker localhost:11683
   user  system elapsed 
  0.013   0.007   7.057 

It's worth noting that the workers are re-used, not restarted for each task. Probably large numbers of workers implies a single machine with multiple cores; it's probably the case that MulticoreParam() is a better choice in this case.

 

ADD COMMENT
0
Entering edit mode

Thank you very much for the crisp and illuminating example you provide; I missed the tasks parameter but that obviously is the one I need to set. Just out of curiosity: you mention that it would be better to use MulticoreParam() on  a single machine with multiple cores; why would that be better and would that provide load balancing out-of-the-box?

ADD REPLY
0
Entering edit mode

MulticoreParam() does not load balance automatically (more tasks implies more communication / data transfer between the manager and workers, so is potentially more expensive in terms of time).

MulticoreParam() is a 'fork' of the original process. Each fork shares memory with the original process, so for instance it is not necessary to load packages on workers. In principle the fork is also memory efficient, since the memory in the fork is only copied when it is changed by the fork. However, R's garbage collector touches almost all R objects, likely triggering a copy in each fork; this was pointed out here.

ADD REPLY
0
Entering edit mode

Ok, that is clear. Thanks for your help once more!

ADD REPLY

Login before adding your answer.

Traffic: 714 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6