How does BiocParallel export variables to threads?
1
1
Entering edit mode
@gabrielhoffman-8391
Last seen 20 days ago
United States

I would like to run analysis in parallel using bplapply in the BiocParallel package. I define a function f that I want to run on each chunk of the data. When I run the analysis in serial, it works fine. But when I run in parallel, bplapply fails because it can't find the function since it isn't exported to the threads. If f were defined in some package I could just specify myPackage::f. I could just paste the definition of f directly into the function evaluated by bplapply, but this example is much simpler than my real issue.

In foreach the .export explicitly exports variables to the threads. But I can't figure this out for bplapply.

Here is a simple example of the problem:

library(BiocParallel)

# custom function
f = function(x){
    x * 10
}

# run in serial
# works fine
param = SerialParam()
res1 = bplapply( 1:10, function(i){
   f(i)
    }, BPPARAM=param)

# run in parallel
# fails because f is not exported to threads
param = SnowParam(2)
res1 = bplapply( 1:10, function(i){
   f(i)
    }, BPPARAM=param)
Error: BiocParallel errors
  element index: 1, 2, 3, 4, 5, 6, ...
  first error: could not find function "f"

Session Info

 sessionInfo()
R version 4.0.1 (2020-06-06)
Platform: x86_64-apple-darwin19.5.0 (64-bit)
Running under: macOS Catalina 10.15.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] BiocParallel_1.22.0

loaded via a namespace (and not attached):
[1] compiler_4.0.1 snow_0.4-3     parallel_4.0.1

but give same error on

R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /hpc/packages/minerva-common/intel/parallel_studio_xe_2018/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base

other attached packages:
[1] BiocParallel_1.16.6

loaded via a namespace (and not attached):
[1] compiler_3.6.0 snow_0.4-3     parallel_3.6.0
BiocParallel SnowParam parallel • 2.0k views
ADD COMMENT
5
Entering edit mode
@martin-morgan-1513
Last seen 5 months ago
United States

Here's a slightly simpler example

f = function(x)
   x * i

When R evaluates this, e.g., f(2), it looks for the variable i in the function body. It's not found, so R looks in the environment (the .GlobalEnv) in which the function was defined. If the variable is not found, R fails. Otherwise, it uses that value

> f(10)
Error in f(10) (from #1) : object 'i' not found
> i = 2; f(10)
[1] 20
> i = 3; f(10)
[1] 30

Here's a trickier example

g = function() {
    i = 2
    function(x) {
        x * i
    }
}

g() defines a variable i and a function, and returns the function, so

> f1 = g()
> f1
function(x) {
    x * i
}
<environment: 0x7fc2c962d3c8>
> rm(i); i
Error: object 'i' not found
> f1(10)
[1] 20
> i = 3; f1(10)
[1] 20

This shows that R looks for i in the environment in which the function is defined (f1 was defined in the body of g()) and not the environment from which it was called f1(10) is called from the .GlobalEnv).

This shows that the environment in which a function is defined is an intrinsic part of the function definition. Neat, eh?

R's 'parallel' package, and BiocParallel, sends the FUN argument of bplapply() to the worker. Since the environment in which the function is defined is an integral part of the function definition, it also sends the environment in which the function was defined to the worker. So

## Works!
bplapply(1:2, f1, BPPARAM = SnowParam(2))

g1 <- function() {
    i = 2
    f = function(x)
        i * x
    bplapply(1:2, f, BPPARAM = SnowParam(2))
}
g1()  # Works! i is in the environment in which f was defined

g2 <- function() {
    i = 2
    bplapply(1:2, function(x) i * x, BPPARAM = SnowParam(2))
}
g2()  # Works! i is in the environment in which the [anonymous] function is defined

So why do these fail?

> i = 2
> bplapply(1:2, function(x) i * x, BPPARAM = SnowParam(2))
Error: BiocParallel errors
  element index: 1, 2
  first error: object 'i' not found
> f = function(x) i * x
> bplapply(1:2, f, BPPARAM = SnowParam(2))
Error: BiocParallel errors
  element index: 1, 2
  first error: object 'i' not found

Shouldn't i be found in the environment in which f is defined? The parallel package, and BiocParallel, serialize environments, but not the .GlobalEnv. This is a base R implementation decision, to avoid the cost of sending the entire .GlobalEnv to the workers.

What about a function in a package that references variables or other functions in that package? The entire package environment, including objects the package imports from other packages (recursively), is sent to the workers, so those definitions are found.

And before trying a direct answer to your question, it's worth noting, perhaps, that for

p = bpstart(SnowParam(2))
h1 = function() {
    x = numeric(1e8)
    bplapply(1:2, function(x) sqrt(x), BPPARAM = p)
}
h2 = function() {
    x = numeric(1e8)
    bplapply(1:2, sqrt, BPPARAM = p)
}

h1() sends the object x to the workers (because it is part of the environment where the anonymous FUN was defined) but for no reason (because it is not used by the function). h2() does not send x to the workers, because it is not a part of the envionment in which sqrt was defined. Because of the extra data movement, h1() is much slower than h2().

After that long answer, it will be disappointing to learn that the answer is that you cannot explicitly export variables to workers. Instead, the best practice is to write functions that are self-contained with respect to the rules sketched above. For instance, I would re-define f so that it knows about all variables it uses

f <- function(x, i)
    x * i

and then invoke bplapply() providing the additional arguments

bplapply(1:2, f, i = 2, BPPARAM = SnowParam(2))
ADD COMMENT
1
Entering edit mode

Just want to say this is a really great explanation, thanks Martin!

ADD REPLY
0
Entering edit mode

Hi Martin, Thanks so much for the detailed explanation. I figured there was no easy solution. Your response was mostly focused on passing a variable to the thread. But it gets much worse if I want to pass a function.

So let me ask a different question since I figure I'm not the only one with this issue. But it will take a little set up first.

I wrote the package variancePartition which includes a function fitVarPartModel(...,fxn) that fits a linear mixed model using lmer, evaluates a summary of the model defined by fxn (defined by the user), and returns the results for each gene. Storing the full model fit for 20K genes or 450K methylation probes is not feasible, so fxn computes a smaller summary of, say, the coefficients and standard errors. This question arose because a user wanted to define fxn to be different than usual. I designed fitVarPartModel this way so that the user could customize their analysis with no involvement from the developer (i.e. me). The use case was simple enough: the user wanted to define fxn to call another function they wrote. Its certainly seems doable, but I was naive.

To run in serial, this is easy:

library(BiocParallel)

# serial
helper2 = function(x) x^2

helper = function(x){
    helper2(x) * 10
}

main = function(x){
    helper(x)
}

res1 = bplapply( 1:10, main, BPPARAM=SerialParam())

If I, as a developer, know there was a specific function users want to run, I could make it easy for the user. I could just include the functions in myPackage then I can refer to them with an "explicit scope" call using myPackage::helper:

# as a developer
helper2 = function(x) x^2

helper = function(x){
    myPackage::helper2(x) * 10
}

main = function(x){
    myPackage::helper(x)
}

res2 = bplapply( 1:10, main, BPPARAM=SnowParam(2))

But in order for a user to get this to work themselves (without writing a custom package with "explicit scope"), I (as a user) modified the function definitions so that it can run it in parallel based on the Martin's response above:

# as a user
helper2 = function(x) x^2

helper = function(x, .helper2){
    .helper2(x) * 10
}

main = function(x, .helper, .helper2){
    .helper(x, .helper2)
}

res2 = bplapply( 1:10, main, .helper=helper, .helper2 = helper2, BPPARAM=SnowParam(2))

But in practice these is not feasible because a user doesn't want to rewrite all their functions. Even if they did, they would have to manually track down any functions called by the helper functions, and then repeat recessively until only functions with "explicit scope" are used.

So my question is: Is there a way as user can define functions with "explicit scope"?

Cheers, Gabriel

ADD REPLY
0
Entering edit mode

I would instead arrange for the helpers to be defined in the same environment as the main function, but have that environment not be the global environment, e.g.,

g = local({
    f = function(x)
        x + 1
    function(x)
        f(x)
})
bplapply(1:2, g, BPPARAM = SnowParam(2))

or a parameterized version following a 'factory' pattern

g_factory = function(k) {
    f = function(x)
        x + k
    function(x)
        f(x)
}
bplapply(1:2, g_factory(2), BPPARAM = SnowParam(2))

If you'd like to open an issue at https://github.com/Bioconductor/BiocParallel asking for a BPEXPORT argument, and referencing this issue, it wouldn't be impossible to implement.

ADD REPLY

Login before adding your answer.

Traffic: 598 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6