Question: defaults in BiocParallel
2
4.9 years ago by
United States
Kasper Daniel Hansen6.4k wrote:

I am starting to look into BiocParallel, probably later than I should.

I am on a 64 core node.  As far as I know, I have done nothing except load GenomicFiles.  If I do

registered()
$MulticoreParam class: MulticoreParam; bpisup: TRUE; bpworkers: 64; catch.errors: TRUE setSeed: TRUE; recursive: TRUE; cleanup: TRUE; cleanupSignal: 15; verbose: FALSE$SnowParam
class: SnowParam; bpisup: FALSE; bpworkers: 64; catch.errors: TRUE
cluster spec: 64; type: PSOCK

$BatchJobsParam class: BatchJobsParam; bpisup: TRUE; bpworkers: NA; catch.errors: TRUE cleanup: TRUE; stop.on.error: FALSE; progressbar: TRUE$SerialParam
class: SerialParam; bpisup: TRUE; bpworkers: 1; catch.errors: TRUE

If I understand it correctly, I now have 4 registered parallel backends (without doing anything) and the default is multicore. I think it is highly problematic for multi-user systems that the default is selected in this way. Specifically, in this case I have not requested 64 cores from my scheduler.  Instead, I believe the default parallel backend should always be serial, and that we need to have user intervention to do more.

In line with this - and wearing my admin cap for this paragraph - I think it would be pretty convenient if it is possible to modify the default choices and settings using environment variables.  This way, suitable choices can be made for some users in a multi-user environment, based on scheduling requests.  For example, I would like to write something in .Rprofile.site which sets the default number of cores in a MulticoreParams, not based on cores-in-machine, but on cores-in-scheduling-request.

Also, I don't understand that the SnowParams is different from what I see with

> SnowParam()
class: SnowParam; bpisup: FALSE; bpworkers: 0; catch.errors: TRUE
cluster spec: 0; type: PSOCK

biocparallel • 1.6k views
modified 22 months ago by Henrik Bengtsson2.4k • written 4.9 years ago by Kasper Daniel Hansen6.4k
0
4.9 years ago by
Martin Morgan ♦♦ 23k
United States
Martin Morgan ♦♦ 23k wrote:

In the release version of the package (version 0.6.1), use options(mc.cores) (including in .Rprofile) to influence the default multicore configuration; see ?MulticoreParam. More generally, register appropriately configured default back-ends for your system by loading (and perhaps not attaching?) and registering your favored configuration in the .Rprofile, BiocParallel::register(BiocParallel::MulticoreParam(3)).

In devel (versions >= 0.99.23), defaults can be set with options(MulticoreParam=quote(MulticoreParam(3)) for instance in .Rprofile. This might be an effective way to configure a shared computer with a batch jobs back end. Also in devel, MulticoreParam() now defaults to a maximum of 8 cores and the call SnowParam() and the default registration are the same (re-using the mc.cores option if set, or the minimum of 8 and the return of detectCores() if not).

I disagree with the choice of 'always serial', and hope the more modest use of default number of cores is more palatable. I chose this strategy thinking that the user with a complicated system (e.g., cluster) would not use registered params anyway.

There are plans to elaborate on a verbose flag, e.g., using a formal logging mechanism. The birth of this is evident in the devel version. It's easy enough to be verbose on the head node, but harder to be verbose from the workers.

Thanks for the links to options.  Regarding SnowParam(), it seems to be the only XXParam() where XXParam() does not give me the same as is already registered.  Note that I did not do anything to register anything; these were all defaults that appeared.

The issue with the default choice of parallel routine is the following. Now (and in the future) we want to move increasingly to using bpapply and friends. Hopefully the long term impact of BiocParallel will be for developers to use bpapply anytime they (now) do lapply and it involves more than a basically instantaneous computation.  This means that a larger set of operations in Bioconductor packages will be automatically parallized, even if the user is unaware. While I love multicore and friends, I note that I have had several instances both on a private machine and on a cluster node, where aggressive use of multicore has crashed the machine.  Here I am particular concerned about unsophisticated new users.

In line with this, it looks to me that we do not have a way of enforcing feedback to the user regarding parallelization.  I think it might be nice to have something like verbose=TRUE/FALSE in the some system settings which would entail user feedback whenever these parallel routines are used.  I could not see this when looking briefly (we could also have verbose levels, so setting verbose to an integer >1 means even more details - this has been very useful to me, in my work).  But perhaps I should get some more experience with the package first.

I'll update my answer with the following -- SnowParam() and the registered default are the same; there's now a better (?) mechanism to read defaults from Rprofile; the default registrations use at most 8 cores.

0
22 months ago by
United States
Henrik Bengtsson2.4k wrote:

Stumbled upon this old thread.  I fully agree with Kasper here (if this is still his position).  Scientific software, including R packages, that defaults to "hijacking" whatever cores are available on a machine are likely to cause problems on multi-tenant environments such as compute cluster, but also on single-user machines where layers of packages run their own parallel code.  Ironically, the problem gets worse the bigger the machine is, i.e. an 8-core machine may be overloaded up-to 8 times (800% CPU load) but a 48-core machine may be overloaded 48 times with these default designs.

The worst is where there is no way to control the number of cores a software uses other than by specifying a command-line option and if that option is ignored, then all cores are used.  That makes then really hard to play well in multi-tenant environments - you basically have to monitor and teach every new user that they need to be aware of this behavior.  This also goes with new software.  The second best is when there is an option to control the default.  Martin suggests that on set:

options(mc.cores = ncores)

in ~/.Rprofile.   I'd like to add, that in a multi-tenant environment the sysadm can also set this for all users in the site-wide Rprofile file file.path(R.home("etc"), "Renviron.site").   Also, if option mc.cores is not set, then it is set according to environment variable MC_CORES when the parallel package is loaded (which it is when BiocParallel is loaded.  Because of this, sysadm can alternatively use

export MC_CORES=1

in the site-wide shell startup script.  I prefer this since it's more likely to survive R updates.  With this, the user has to explicitly override its value in the job script, e.g. according to a job schedulers environment variable.

Because of the above, I argue that parallelism in scientific software should be explicitly requested by the user (or implicitly via job submission scripts / env vars).  This also helps protect against recursive parallelism, which we will see more of in R as parallel processing gets easier to use and as package dependency graph grows (and where neither the user nor the developer is in control of all the software stack).