Question: Memory usage for bplapply -- is the entire environment copied over to each worker thread?
4
4.1 years ago by
Lauren40
Australia
Lauren40 wrote:

Hello,

I am conducting an analysis of RNA seq data with DESeq2, and I noticed that when I run DESeq2 in parallel (using the parallel=TRUE option), each worker thread uses the same amount of RAM as the parent session/thread. In my case, the parent session/thread has a number of objects  in memory -- totalling approximately 10Gb. The objects required for the DESeq function are about 1/5th of this. I note that all the worker threads appear to be using 10Gb straight away... Does this mean that the bplapply function copies the parent environment over?

> BPPARAM
class: MulticoreParam
bplog:FALSE; bpthreshold:INFO; bplogdir:NA
bpstopOnError:FALSE; bpprogressbar:FALSE
bpresultdir:NA
cluster type: FORK

If it does copy the environment over, is it possible to create a specific environment to hand over (to avoid all the unnecessarry duplication), or do I need to clear the workspace of all irrelevant objects every time I want to run in parallel?

Much appreciated!

deseq2 biocparallel bplapply • 1.7k views
modified 4.1 years ago by Valerie Obenchain6.7k • written 4.1 years ago by Lauren40

What operating system are you working under? Also how big is your experiment, that seems like more memory that DESeq2 should typically consume

Hi, I am operating on a Linux based OS (Ubuntu) and the experiment is 600 samples by 100,000 species (it is a metagenomic/amplicon study).

Answer: Memory usage for bplapply -- is the entire environment copied over to each worke
3
4.1 years ago by
United States
Valerie Obenchain6.7k wrote:

Hi Lauren,

I'm not sure how you're measuring memory use on the master vs the workers. It would be helpful to see some code and the output of sessionInfo().

BiocParallel does not copy over the parent environment to the workers. Which arguments get passed to the workers depends on how you call bplapply(). You may already know this but here is an example just in case.

Register a param:

register(SnowParam()) ## could also use MulticoreParam()

'fun' lists the explicitly passed args and those passed through '...', as well as the contents of the environment.

fun <- function(i, ...) { list(objects=objects(), dots=list(...), env=as.list(environment())) }

'y' is defined in the workspace (ie, on the master).
y <- "foo"

'y' not passed:
xx <- data.frame(matrix(1:10, ncol=2)) bplapply(xx, fun) $X1$X1$objects [1] "i" $X1$dots list()$X1$env$X1$env$i [1] 1 2 3 4 5 $X2$X2$objects [1] "i"$X2$dots list()$X2$env$X2$env$i [1] 6 7 8 9 10

'y' explicitly passed:
bplapply(xx, fun, y=y) $X1$X1$objects [1] "i"$X1$dots$X1$dots$y [1] "foo" $X1$env $X1$env$i [1] 1 2 3 4 5$X2 $X2$objects [1] "i" $X2$dots $X2$dots$y [1] "foo"$X2$env$X2$env$i [1] 6 7 8 9 10

You can test 'myfun' and see that 'y' is passed implicitly through '...'.

myfun <- function(xx, fun, ...) { bplapply(xx, fun, ...) } myfun(xx, fun, y)

The take home is that sending objects to the workers is done by passing args to bplapply (and friends) either (1) explicitly or (2) through '...' when wrapped in another function.

Regarding memory use, setting log = TRUE tracks memory use on the workers. The current worker script in BiocParallel calls gc(reset=TRUE) before the computation but only reported the output of gc() called after the computation. As of 1.3.38 (devel) and 1.2.12 (release) I've changed this to report the difference in the 2 calls to gc(). This represents the change in max memory used and is more helpful, I think, in assessing the memory used by the process.

Register a param with logging enabled:

register(SnowParam(log = TRUE))

'fun' performs a reasonable amount of computation on the workers but very little data are passed to and from the master. The idea is to create an obvious difference in memory use that can be seen through gc().

fun <- function(i) {   alist <-as.list(seq_len(1e5))   amat <- matrix(1:1e4, ncol=20)   aprod <- amat %*% t(amat)   colSums(aprod) }

This chunk of code mimics what is done on the workers with respect to measuring memory with gc(). Comparing the memory use on the master (3.4 Ncells and 0.5 Vcells) vs the workers (5.5 Ncells and 4.1 Vcells) we see they aren't the same.

Code to be executed:

gc0 = gc(reset=TRUE) res <- bplapply(1:2, fun) (gc() - gc0)[,5:6]

Executed:

> gc0 = gc(reset=TRUE) > res <- bplapply(1:2, fun) INFO [2015-07-21 22:02:38] loading futile.logger on workers ############### LOG OUTPUT ############### Task: 1 Node: 1 Timestamp: 2015-07-21 22:02:38 Success: TRUE Task duration: user system elapsed 0.007 0.001 0.007 Max Memory Used: max used (Mb) Ncells 102147 5.5 Vcells 545747 4.1 Log messages: stderr and stdout: character(0) ############### LOG OUTPUT ############### Task: 2 Node: 2 Timestamp: 2015-07-21 22:02:38 Success: TRUE Task duration: user system elapsed 0.004 0.003 0.008 Max Memory Used: max used (Mb) Ncells 102147 5.5 Vcells 545747 4.1 Log messages: stderr and stdout: character(0) > (gc() - gc0)[,5:6] max used (Mb) Ncells 64538 3.4 Vcells 74546 0.5

The new versions of BiocParallel should be available via biocLite() Thursday July 23 by noon PST or immediately via svn. You should be able to pass a BPPARAM to the DESeq2 function that has logging enabled, eg, SnowParam(log=TRUE). Once you do this, you should have more information about memory use on the workers. Let me know if this does not make sense or you have trouble interpreting results.

Valerie

1

BiocParallel supports 4 different back-ends all with different characteristics. My example above was with SnowParam() for non-shared memory computing and the point was to say bplapply() itself does not copy the master environment and send it to the workers. For each param, workers are created by different mechanisms - MulticoreParam() shared memory workers (forks) do inherit from the master, BatchJobsParam uses a different approach as does DoparParam().

It's probably best to start with your code example (how you are calling the function and measuring memory use) and the output of sessionInfo() and go from there.

Valerie

I cleaned my environment up before running again (the parent session is now using 2Gb). I was looking at the RAM usage on top (each child process was reported to be using 2Gb). But I just noticed that if I look at RAM usage via the System Monitor,  it is only 650Mb per child thread. I will install the new version of biocparallel when it comes available to see how the RAM is being used.

I was following the DESeq2 manual on how to run in parallel:

library("BiocParallel")
register(MulticoreParam(4))
1

Hi Mike (and Lauren),

I wanted to follow up on the conversation Mike and I had at BioC2015 about how to measure memory use on the workers. There are several approaches and I've gone back and forth on which would be best / most informative.

One approach is pryr::mem_change() which looks at the difference between calls to gc() before and after code evaluation. (It's more elegant than that, I'm just summarizing.) This approach has the advantage of isolating just the memory used for the expression evaluated.

Another approach is to call gc(reset=TRUE) before evaluation and report the gc() called after. This approach has some residual built into the output which essentially equals a call to gc() in a fresh R session. This is how the BatchJobs package reports memory use in '.out' files.

I think either approach is fine as long as one understands what the output is and how to interpret it. In the interest of being consistent across back-ends I choose to go with the second option.

Comparing memory use in a '.out' file from BatchJobsParam may look slightly higher than one from MulticoreParam or SnowParam. This is because the script in BatchJobs does a little more work than the one in BiocParallel but overall the use numbers are quite comparable.

These changes are in 1.3.41 and 1.2.14. If anyone else on the list has an opinion on how to best track memory feel free to chime in.

Valerie

2

Thanks to Martin's help we can shed some light on the memory consumption you're seeing with MulticoreParam (ie, workers created with fork()).

He directed me to this post from Radford discussing memory use in mclapply. To summarize, forked workers share memory with the parent process. If they don't 'touch' (write to) large objects on the parent, no harm done. The garbage collector, however, writes to each object to mark/unmark it when a full collection is done and this may trigger a copy of the object.

It's not possible to turn the garbage collector off. One possible way of influencing how frequently collection occurs may be the command-line options --min-vsize and --min-nsize. With large values for these parameters garbage collection might not occur (we think).

This example demonstrates memory growth in a function that is doing nothing but gc(). A large object is created in the master environment but not passed to the worker. The workers have access to the master's memory and when gc() occurs the large object is 'touched' and copied. Running top in a terminal window you'll see a jump in 'mem used' when gc() is hit and then a return to initial levels.

## sleep, gc, sleep some more fun <- function(i) { Sys.sleep(10) gc(); gc(); gc() Sys.sleep(10) }

## large object on master mat <- 1e8

## large object not passed to worker mclapply(1:8, fun, mc.cores=8)

Valerie

thanks for this information Valerie (and Martin). surprising and good to know: "The garbage collector, however, writes to each object to mark/unmark it when a full collection is done and this may trigger a copy of the object."

I guess for now, I'll make a note about this in the vignette and man pages, to say if possible users should remove any large unneeded objects from the environment, as these might be copied by the workers.

hi Valerie,

I get the same output for your example, I'm a bit confused why I can't find 'y' in my example:

library(BiocParallel)
register(SnowParam(2))
fun <- function(i, ...) {
list(objects=objects(), dots=list(...), env=as.list(environment()))
}
y <- 7
xx <- data.frame(matrix(1:10, ncol=2))

> bplapply(xx, fun, y=y)
starting worker for localhost:11802
starting worker for localhost:11802
$X1$X1$objects [1] "i"$X1$dots$X1$dots$y
[1] 7

$X1$env
$X1$env$i [1] 1 2 3 4 5$X2
$X2$objects
[1] "i"

$X2$dots
$X2$dots$y [1] 7$X2$env$X2$env$i
[1]  6  7  8  9 10

# what if i try to use 'y'
fun <- function(i, ...) {
i + y
}

!> bplapply(xx, fun, y=y)
starting worker for localhost:11802
starting worker for localhost:11802
$X1 <remote-error in FUN(...): object 'y' not found> traceback() available as 'attr(x, "traceback")'$X2
traceback() available as 'attr(x, "traceback")'

!> sessionInfo()
R Under development (unstable) (2015-07-02 r68623)
Platform: x86_64-apple-darwin14.3.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices datasets  utils     methods
[8] base

other attached packages:
[1] BiocParallel_1.3.48

loaded via a namespace (and not attached):
[1] snow_0.3-13          futile.logger_1.4.1  lambda.r_1.1.7
[4] futile.options_1.0.0 git2r_0.10.1
1

Hi,

This is not specific to parallel work but to function arguments in general. Below 'y' is a defined arg in in fun1 and falls through '...' in fun2. The '...' is useful for passing args down through multiple functions until you need them and avoids the need to explicitly define them as arguments at each step (nested functions). When you want to use them in the function body, however, 'y' must be defined as an arg or retrieved from '...' with something like list(...)\$y (not recommended).

fun1 <- function(x, y, ...) {
x + y
}

fun2 <- function(x, ...) {
x + y
}

> lapply(1:3, fun1, 10)
[[1]]
[1] 11

[[2]]
[1] 12

[[3]]
[1] 13

> lapply(1:3, fun2, 10)
> lapply(1:3, fun2, y=10)
Error in FUN(X[[i]], ...) : object 'y' not found

I think that's what you were after - let me know if that doesn't answer your question.

Valerie

ADD REPLYlink modified 4.0 years ago by Michael Love24k • written 4.0 years ago by Valerie Obenchain6.7k

right, of course. thanks for the quick reply. there's always more elementary R to learn :)