I would like to run analysis in parallel using bplapply
in the BiocParallel package. I define a function f
that I want to run on each chunk of the data. When I run the analysis in serial, it works fine. But when I run in parallel, bplapply
fails because it can't find the function since it isn't exported to the threads. If f
were defined in some package I could just specify myPackage::f
. I could just paste the definition of f
directly into the function evaluated by bplapply
, but this example is much simpler than my real issue.
In foreach
the .export
explicitly exports variables to the threads. But I can't figure this out for bplapply
.
Here is a simple example of the problem:
library(BiocParallel)
# custom function
f = function(x){
x * 10
}
# run in serial
# works fine
param = SerialParam()
res1 = bplapply( 1:10, function(i){
f(i)
}, BPPARAM=param)
# run in parallel
# fails because f is not exported to threads
param = SnowParam(2)
res1 = bplapply( 1:10, function(i){
f(i)
}, BPPARAM=param)
Error: BiocParallel errors
element index: 1, 2, 3, 4, 5, 6, ...
first error: could not find function "f"
Session Info
sessionInfo()
R version 4.0.1 (2020-06-06)
Platform: x86_64-apple-darwin19.5.0 (64-bit)
Running under: macOS Catalina 10.15.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] BiocParallel_1.22.0
loaded via a namespace (and not attached):
[1] compiler_4.0.1 snow_0.4-3 parallel_4.0.1
but give same error on
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /hpc/packages/minerva-common/intel/parallel_studio_xe_2018/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] BiocParallel_1.16.6
loaded via a namespace (and not attached):
[1] compiler_3.6.0 snow_0.4-3 parallel_3.6.0
Just want to say this is a really great explanation, thanks Martin!
Hi Martin, Thanks so much for the detailed explanation. I figured there was no easy solution. Your response was mostly focused on passing a variable to the thread. But it gets much worse if I want to pass a function.
So let me ask a different question since I figure I'm not the only one with this issue. But it will take a little set up first.
I wrote the package
variancePartition
which includes a functionfitVarPartModel(...,fxn)
that fits a linear mixed model usinglmer
, evaluates a summary of the model defined byfxn
(defined by the user), and returns the results for each gene. Storing the full model fit for 20K genes or 450K methylation probes is not feasible, sofxn
computes a smaller summary of, say, the coefficients and standard errors. This question arose because a user wanted to definefxn
to be different than usual. I designedfitVarPartModel
this way so that the user could customize their analysis with no involvement from the developer (i.e. me). The use case was simple enough: the user wanted to definefxn
to call another function they wrote. Its certainly seems doable, but I was naive.To run in serial, this is easy:
If I, as a developer, know there was a specific function users want to run, I could make it easy for the user. I could just include the functions in
myPackage
then I can refer to them with an "explicit scope" call usingmyPackage::helper
:But in order for a user to get this to work themselves (without writing a custom package with "explicit scope"), I (as a user) modified the function definitions so that it can run it in parallel based on the Martin's response above:
But in practice these is not feasible because a user doesn't want to rewrite all their functions. Even if they did, they would have to manually track down any functions called by the helper functions, and then repeat recessively until only functions with "explicit scope" are used.
So my question is: Is there a way as user can define functions with "explicit scope"?
Cheers, Gabriel
I would instead arrange for the helpers to be defined in the same environment as the main function, but have that environment not be the global environment, e.g.,
or a parameterized version following a 'factory' pattern
If you'd like to open an issue at https://github.com/Bioconductor/BiocParallel asking for a
BPEXPORT
argument, and referencing this issue, it wouldn't be impossible to implement.