Dear all,
I'm having some performance issues with BiocParallel::bplapply that I think are somewhat related to this old post:
BiocParallel::bplapply() performance issue
I have started a new post because I'm using a much newer version of BiocParallel here (1.11.2), but I will use the same example:
> library(parallel) > library(BiocParallel) > system.time(lapply(1:1e2,function(x) order(rnorm(n=1e3)))) user system elapsed 0.020 0.001 0.022 > system.time(mclapply(1:1e2,function(x) order(rnorm(n=1e3)),mc.cores=1)) user system elapsed 0.010 0.000 0.011 > system.time(BiocParallel::bplapply(1:1e2 , function(x) order(rnorm(n=1e3)), BPPARAM = MulticoreParam(workers = 1))) user system elapsed 0.022 0.003 0.025
Although bplapply and mclapply have the same performance with one worker, if I increase the workers to 2, bplapply becomes much slower than mclapply. This is true independently of the number of `tasks`, and as in the linked post seems to be related to which packages are loaded. Going back to the old post example, I get:
> system.time(mclapply(1:1e2,function(x) order(rnorm(n=1e3)),mc.cores=2)) user system elapsed 0.002 0.006 0.015 > system.time(BiocParallel::bplapply(1:1e2 , function(x) order(rnorm(n=1e3)), BPPARAM = MulticoreParam(workers = 2))) user system elapsed 0.053 0.018 0.204 > library(SummarizedExperiment) > library(matrixStats) > library(magrittr) > library(ggplot2) > library(biomaRt) > system.time(BiocParallel::bplapply(1:1e2 , function(x) order(rnorm(n=1e3)), BPPARAM = MulticoreParam(workers = 2))) user system elapsed 0.047 0.014 1.005 > system.time(BiocParallel::bplapply(1:1e2 , function(x) order(rnorm(n=1e3)), BPPARAM = MulticoreParam(workers = 2, tasks = 2))) user system elapsed 0.005 0.006 0.964
Note that the packages that I attached here are those that I load in my vignette, where I first noticed the problem, but it appears that just loading SummarizedExperiment will cause the same issue.
> sessionInfo() R version 3.4.0 (2017-04-21) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Sierra 10.12.5 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] biomaRt_2.33.1 ggplot2_2.2.1 [3] magrittr_1.5 scRNAseq_1.3.0 [5] SummarizedExperiment_1.7.4 DelayedArray_0.3.6 [7] matrixStats_0.52.2 Biobase_2.37.2 [9] GenomicRanges_1.29.4 GenomeInfoDb_1.13.2 [11] IRanges_2.11.3 S4Vectors_0.15.3 [13] BiocGenerics_0.23.0 BiocParallel_1.11.2 loaded via a namespace (and not attached): [1] Rcpp_0.12.11 compiler_3.4.0 plyr_1.8.4 [4] XVector_0.17.0 prettyunits_1.0.2 bitops_1.0-6 [7] tools_3.4.0 zlibbioc_1.23.0 progress_1.1.2 [10] digest_0.6.12 RSQLite_1.1-2 memoise_1.1.0 [13] tibble_1.3.3 gtable_0.2.0 lattice_0.20-35 [16] rlang_0.1.1 Matrix_1.2-10 DBI_0.6-1 [19] GenomeInfoDbData_0.99.0 grid_3.4.0 R6_2.2.1 [22] AnnotationDbi_1.39.0 XML_3.98-1.7 scales_0.4.1 [25] assertthat_0.2.0 colorspace_1.3-2 RCurl_1.95-4.8 [28] lazyeval_0.2.0 munsell_0.4.3
I'll work further on this, noting that the original problem was much more severe than the one reported here.
If you were using parallel evaluation multiple time, the cost of establishing the cluster can be minimized by opening it first, e.g.,
I don't really think that the use case implied by the test -- many very fast iterations -- is the right context for R-level parallel evaluation, just do the operation without the complexity of parallelization
order(rnorm(n=1e3 * 1e2))
. This is especially true for code in a package, where approximately 1/2 our users will be on Windows and using independent processes, along the lines ofWindows users must necessarily pay the cost of starting separate processes. Also R-level code (casting no aspersions!) can often be written to run two or more orders of magnitude faster by using vectorization rather than iteration; in new package submissions my response when I see the use of parallel packages of any sort is to ask whether the code itself should be refactored, usually resulting in simpler, much faster, and more robust code. The usual steps are to 'hoist' constant sub-expressions out of loops, then hoist vectorizable sub-expressions out of the loop as pre-computed vectors.
When the granularity of the task is larger, then the overhead of parallel evaluation becomes unimportant.
Thanks Martin and Johannes! Both of your suggestions are appreciated!
I agree with Martin's point on vectorizing operations, but I came across this behavior and wanted to get your opinion on this.
Doesn't really answer your question, but since I also experienced problems with
MulticoreParam
on macOS...On mac i switched from
MulticoreParam
toDoparParam
, i.e. I'm using thedoParallel
package for parallel processing. I had the feeling that multicore/MulticoreParam had a problem with the forks, thus I prefer pre-registering the number of processes before:My sessionInfo: