BiocParallel::bplapply() performance with MulticoreParam() is worse than mclapply()
Dear all,

I'm having some performance issues with BiocParallel::bplapply that I think are somewhat related to this old post:

BiocParallel::bplapply() performance issue

I have started a new post because I'm using a much newer version of BiocParallel here (1.11.2), but I will use the same example:

> library(parallel)
> library(BiocParallel)
> system.time(lapply(1:1e2,function(x) order(rnorm(n=1e3))))
   user  system elapsed
  0.020   0.001   0.022
> system.time(mclapply(1:1e2,function(x) order(rnorm(n=1e3)),mc.cores=1))
   user  system elapsed
  0.010   0.000   0.011
> system.time(BiocParallel::bplapply(1:1e2 , function(x) order(rnorm(n=1e3)), BPPARAM = MulticoreParam(workers = 1)))
   user  system elapsed
  0.022   0.003   0.025

Although bplapply and mclapply have the same performance with one worker, if I increase the workers to 2, bplapply becomes much slower than mclapply. This is true independently of the number of `tasks`, and as in the linked post seems to be related to which packages are loaded. Going back to the old post example, I get:

> system.time(mclapply(1:1e2,function(x) order(rnorm(n=1e3)),mc.cores=2))
   user  system elapsed
  0.002   0.006   0.015
> system.time(BiocParallel::bplapply(1:1e2 , function(x) order(rnorm(n=1e3)), BPPARAM = MulticoreParam(workers = 2)))
   user  system elapsed
  0.053   0.018   0.204
> library(SummarizedExperiment)
> library(matrixStats)
> library(magrittr)
> library(ggplot2)
> library(biomaRt)
> system.time(BiocParallel::bplapply(1:1e2 , function(x) order(rnorm(n=1e3)), BPPARAM = MulticoreParam(workers = 2)))
   user  system elapsed
  0.047   0.014   1.005
> system.time(BiocParallel::bplapply(1:1e2 , function(x) order(rnorm(n=1e3)), BPPARAM = MulticoreParam(workers = 2, tasks = 2)))
   user  system elapsed 
  0.005   0.006   0.964

Note that the packages that I attached here are those that I load in my vignette, where I first noticed the problem, but it appears that just loading SummarizedExperiment will cause the same issue.

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] biomaRt_2.33.1             ggplot2_2.2.1             
 [3] magrittr_1.5               scRNAseq_1.3.0            
 [5] SummarizedExperiment_1.7.4 DelayedArray_0.3.6        
 [7] matrixStats_0.52.2         Biobase_2.37.2            
 [9] GenomicRanges_1.29.4       GenomeInfoDb_1.13.2       
[11] IRanges_2.11.3             S4Vectors_0.15.3          
[13] BiocGenerics_0.23.0        BiocParallel_1.11.2       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.11            compiler_3.4.0          plyr_1.8.4             
 [4] XVector_0.17.0          prettyunits_1.0.2       bitops_1.0-6           
 [7] tools_3.4.0             zlibbioc_1.23.0         progress_1.1.2         
[10] digest_0.6.12           RSQLite_1.1-2           memoise_1.1.0          
[13] tibble_1.3.3            gtable_0.2.0            lattice_0.20-35        
[16] rlang_0.1.1             Matrix_1.2-10           DBI_0.6-1              
[19] GenomeInfoDbData_0.99.0 grid_3.4.0              R6_2.2.1               
[22] AnnotationDbi_1.39.0    XML_3.98-1.7            scales_0.4.1           
[25] assertthat_0.2.0        colorspace_1.3-2        RCurl_1.95-4.8         
[28] lazyeval_0.2.0          munsell_0.4.3 
I'll work further on this, noting that the original problem was much more severe than the one reported here.

If you were using parallel evaluation multiple time, the cost of establishing the cluster can be minimized by opening it first, e.g., 

> register(bpstart(MulticoreParam(workers=2)))
> system.time(BiocParallel::bplapply(1:1e2 , function(x) order(rnorm(n=1e3))))
   user  system elapsed 
  0.004   0.000   0.110 

I don't really think that the use case implied by the test -- many very fast iterations -- is the right context for R-level parallel evaluation, just do the operation without the complexity of parallelization order(rnorm(n=1e3 * 1e2)). This is especially true for code in a package, where approximately 1/2 our users will be on Windows and using independent processes, along the lines of

> library(BiocParallel)
> register(SnowParam(2))
> system.time(BiocParallel::bplapply(1:1e2 , function(x) order(rnorm(n=1e3))))
   user  system elapsed 
  0.080   0.000   0.865

Windows users must necessarily pay the cost of starting separate processes. Also R-level code (casting no aspersions!) can often be written to run two or more orders of magnitude faster by using vectorization rather than iteration; in new package submissions my response when I see the use of parallel packages of any sort is to ask whether the code itself should be refactored, usually resulting in simpler, much faster, and more robust code. The usual steps are to 'hoist' constant sub-expressions out of loops, then hoist vectorizable sub-expressions out of the loop as pre-computed vectors.

When the granularity of the task is larger, then the overhead of parallel evaluation becomes unimportant.


Thanks Martin and Johannes! Both of your suggestions are appreciated!

I agree with Martin's point on vectorizing operations, but I came across this behavior and wanted to get your opinion on this.

Doesn't really answer your question, but since I also experienced problems with MulticoreParam on macOS...

On mac i switched from MulticoreParam to DoparParam, i.e. I'm using the doParallel package for parallel processing. I had the feeling that multicore/MulticoreParam had a problem with the forks, thus I prefer pre-registering the number of processes before:


## First using Multicore:
system.time(bplapply(1:1e2 , function(x) order(rnorm(n=1e3))))
   user  system elapsed
  0.107   0.029   0.329

## Now with doPar:
system.time(bplapply(1:1e2 , function(x) order(rnorm(n=1e3))))
   user  system elapsed
  0.040   0.020   0.041


My sessionInfo:

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin16.7.0/x86_64 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] BiocParallel_1.10.1 doParallel_1.0.10   iterators_1.0.8    
[4] foreach_1.4.3      

loaded via a namespace (and not attached):
[1] compiler_3.4.0   tools_3.4.0      codetools_0.2-15



