I have been using BiocParallel for a while (using sockets and snow; please see below) and have noted that at least using these settings there seems to be no load balancing (the distribution of the work seems to occur only once, right at the beginning. Therefore workers that are faster (or have less calculations to do) are sometimes inactive while there are jobs queued for other slower workers). Are there settings of BiocParallel that would allow load balancing (for instance by changing to SerialParam / MulticoreParam / BatchJobsParam / DoparParam instances or by just changing the settings for SnowParam).
Thank you very much,
Thomas Kuilman
library(BiocParallel) bp.param <- SnowParam(workers = 32, type = "SOCK") bplapply(x, FUN, BPPARAM = bp.param)
Thank you very much for the crisp and illuminating example you provide; I missed the tasks parameter but that obviously is the one I need to set. Just out of curiosity: you mention that it would be better to use
MulticoreParam()
on a single machine with multiple cores; why would that be better and would that provide load balancing out-of-the-box?MulticoreParam() does not load balance automatically (more tasks implies more communication / data transfer between the manager and workers, so is potentially more expensive in terms of time).
MulticoreParam() is a 'fork' of the original process. Each fork shares memory with the original process, so for instance it is not necessary to load packages on workers. In principle the fork is also memory efficient, since the memory in the fork is only copied when it is changed by the fork. However, R's garbage collector touches almost all R objects, likely triggering a copy in each fork; this was pointed out here.
Ok, that is clear. Thanks for your help once more!