Search
Question: BiocParallel on Windows : cannot find my functions when using SnowParam
2
gravatar for Wolfgang Raffelsberger
9 months ago by
Wolfgang Raffelsberger60 wrote:

Dear list, dear guRus,

first of all, great thanks for all the wonderful packages !

When making code using BiocParallel that should allow some parallel computations on both Linux and Windows I noticed the following surprising behaviour (ultimately creating an error message):

 

Note, at this point I'm using Windows ! When setting/changing BPPARAM from MulticoreParam() to SnowParam() other functions previously declared may not be available any more. This happens only when a new function is declared within the bplapply command, finally an error message will appear.

In the end I'll switch BPPARAM according to the current platform detected as either MulticoreParam or to SnowParam, the rest of the code should remain the same.

 

So the workaround I see so far, consists in avoiding declaring new functions within bplapply() .

However, I thought sharing this (to me quite unexpected) behaviour might be useful on this list.

Any comments/hints ? Am I doing somthing wrong the way I'm calling SnowParam() ?

 

Best greetings,

Wolfgang Raffelsberger

## here an example to illustrate my observations on Windows
library("BiocParallel")
myFun1 <- function(x,val) val+sum(c(x,x^2,x^3))
testMu <- bplapply(1:3,myFun1,val=10,BPPARAM=MulticoreParam(workers=3))                           # OK
testSn <- bplapply(1:3,myFun1,val=10,BPPARAM=SnowParam(workers=3,type="SOCK"))                    # OK

## but
testMu <- bplapply(1:3,function(v) myFun1(v,val=10),BPPARAM=MulticoreParam(workers=3))            # OK
testSn <- bplapply(1:3,function(v) myFun1(v,val=10),BPPARAM=SnowParam(workers=3,type="SOCK"))     # error !

## output of traceback
> traceback(testSn <- bplapply(1:3,function(v) myFun1(v,val=10),BPPARAM=SnowParam(workers=3,type="SOCK")))
Erreur : BiocParallel errors
  element index: 1, 2, 3
  first error: impossible de trouver la fonction "myFun1"

## for completeness - output of sessionInfo
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252   
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
[5] LC_TIME=French_France.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocParallel_1.8.1

loaded via a namespace (and not attached):
[1] snow_0.4-2     tools_3.3.2    parallel_3.3.2
ADD COMMENTlink modified 9 months ago by Martin Morgan ♦♦ 20k • written 9 months ago by Wolfgang Raffelsberger60
1
gravatar for Martin Morgan
9 months ago by
Martin Morgan ♦♦ 20k
United States
Martin Morgan ♦♦ 20k wrote:

Linux / MacOS default to MulticoreParam(). Windows doesn't support MulticoreParam(), so defaults to SnowParam().

MulticoreParam() uses a 'shared memory' model where the workers share the memory of the calling parent, so automatically 'know' about functions that are defined in the manager R session.

SnowParam() starts separate processes that do not know about one another at all. It has rules for transferring objects from the manager environment to the worker environment. To understand the rules, one needs to know that every R symbol is defined in an environment, and that environments have 'parent' (possibly empty) environments. Working at the R prompt, one is in the .GlobalEnv environment. The rule is to NOT export symbols in the global environment to the workers. So

register(bpstart(SnowParam(2)))   # active snow cluster for the session
fun1 = function(x) x
result = bplapply(1:2, function(x) fun1(x))

fails -- fun1 is defined in the global environment, but not exported to the worker.

A simple solution is to make sure that the FUN argument to bplapply() references symbols that are either part of base R or are passed in as arguments, so

result = bplapply(1:2, function(x, doit_fun) doit_fun(x), doit_fun=fun1)

works.

A second solution is illustrated by

f = function() {
    fun1 = function(x) x
    bplapply(1:2, function(x) fun1(x))
}
result <- f()

This works, because the rule is that symbols defined in the environment (other than the global environment) where bplapply() is invoked (the body of each function, e.g., f(), represents an environment; the parent of the environment is the environment in which the function was defined, e.g., the parent environment of f() is the global environment) are forwarded to the worker.

The rule about exporting symbols includes parent environments, so

f = function() {
    fun1 = function(x) x
    g = function() {
        bplapply(1:2, function(x) fun1(x))
    }
    g()
}
f()

also works -- bplapply exports the environment g(), and the parent environment of g() (i.e., the environment f()), but not the parent environment of f() (the global environment).

The reason for 'stopping' at the global environment also illustrates a potential hazard. The global environment frequently contains many and sometimes large symbols irrelevant to the calculation, so it would be inefficient to export all of these. Note though that with

f <- function(n) {
    m <- integer(n)
    system.time(bplapply(1:2, function(x) x))
}

have evaluation times

> f(1e6)
   user  system elapsed 
  0.016   0.000   0.093 
> f(1e8)
   user  system elapsed 
  1.052   0.096   1.466 

with the additional cost from sending the (unused) integer vector m to the workers.

The behavior is inherited from the snow and parallel packages, and is not an arbitrary decision  of BiocParallel.

The function bpvalidate() applied to the function used in bplapply() can help spot problematic code.

Cross-platform package developers should test their code using SnowParam(), to ensure that their package works on windows or in a cluster where nodes necessarily do not share memory.

The 'best practice' when implementing functions that use bplapply() is to do as above -- do NOT specify the default parameter BPPARAM, allowing the user to register() or provide their own back-end.

 

ADD COMMENTlink modified 9 months ago • written 9 months ago by Martin Morgan ♦♦ 20k

I have been trying to understand and read several posts about sending data (objects, functions, whatever) to workers. And I just can't seem to get it. It seems to be that the way that it is explained always is just impenetrable .... I have read about environments etc. I have a situation where I have a function that uses parallel processing inside it. So obviously you want to pass data, arguments etc from the function call to the workers. I have ended up writing temporary files in the "main" part of the function (with a defined file name) that are loaded in by the workers, but surely this cannot be the optimal way...

ADD REPLYlink written 6 weeks ago by Pekka Kohonen190

start your own question and include a SIMPLE example of what you are trying to do -- the description above isn't enough to understand how to help.

ADD REPLYlink written 6 weeks ago by Martin Morgan ♦♦ 20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 309 users visited in the last hour