Hi Lauren,
I'm not sure how you're measuring memory use on the master vs the workers. It would be helpful to see some code and the output of sessionInfo().
BiocParallel does not copy over the parent environment to the workers. Which arguments get passed to the workers depends on how you call bplapply(). You may already know this but here is an example just in case.
Register a param:
register(SnowParam()) ## could also use MulticoreParam()
'fun' lists the explicitly passed args and those passed through '...', as well as the contents of the environment.
fun <- function(i, ...) {
list(objects=objects(), dots=list(...), env=as.list(environment()))
}
'y' is defined in the workspace (ie, on the master).
y <- "foo"
'y' not passed:
xx <- data.frame(matrix(1:10, ncol=2))
bplapply(xx, fun)
$X1
$X1$objects
[1] "i"
$X1$dots
list()
$X1$env
$X1$env$i
[1] 1 2 3 4 5
$X2
$X2$objects
[1] "i"
$X2$dots
list()
$X2$env
$X2$env$i
[1] 6 7 8 9 10
'y' explicitly passed:
bplapply(xx, fun, y=y)
$X1
$X1$objects
[1] "i"
$X1$dots
$X1$dots$y
[1] "foo"
$X1$env
$X1$env$i
[1] 1 2 3 4 5
$X2
$X2$objects
[1] "i"
$X2$dots
$X2$dots$y
[1] "foo"
$X2$env
$X2$env$i
[1] 6 7 8 9 10
You can test 'myfun' and see that 'y' is passed implicitly through '...'.
myfun <- function(xx, fun, ...) {
bplapply(xx, fun, ...)
}
myfun(xx, fun, y)
The take home is that sending objects to the workers is done by passing args to bplapply (and friends) either (1) explicitly or (2) through '...' when wrapped in another function.
Regarding memory use, setting log = TRUE tracks memory use on the workers. The current worker script in BiocParallel calls gc(reset=TRUE) before the computation but only reported the output of gc() called after the computation. As of 1.3.38 (devel) and 1.2.12 (release) I've changed this to report the difference in the 2 calls to gc(). This represents the change in max memory used and is more helpful, I think, in assessing the memory used by the process.
Register a param with logging enabled:
register(SnowParam(log = TRUE))
'fun' performs a reasonable amount of computation on the workers but very little data are passed to and from the master. The idea is to create an obvious difference in memory use that can be seen through gc().
fun <- function(i) {
alist <-as.list(seq_len(1e5))
amat <- matrix(1:1e4, ncol=20)
aprod <- amat %*% t(amat)
colSums(aprod)
}
This chunk of code mimics what is done on the workers with respect to measuring memory with gc(). Comparing the memory use on the master (3.4 Ncells and 0.5 Vcells) vs the workers (5.5 Ncells and 4.1 Vcells) we see they aren't the same.
Code to be executed:
gc0 = gc(reset=TRUE)
res <- bplapply(1:2, fun)
(gc() - gc0)[,5:6]
Executed:
> gc0 = gc(reset=TRUE)
> res <- bplapply(1:2, fun)
INFO [2015-07-21 22:02:38] loading futile.logger on workers
############### LOG OUTPUT ###############
Task: 1
Node: 1
Timestamp: 2015-07-21 22:02:38
Success: TRUE
Task duration:
user system elapsed
0.007 0.001 0.007
Max Memory Used:
max used (Mb)
Ncells 102147 5.5
Vcells 545747 4.1
Log messages:
stderr and stdout:
character(0)
############### LOG OUTPUT ###############
Task: 2
Node: 2
Timestamp: 2015-07-21 22:02:38
Success: TRUE
Task duration:
user system elapsed
0.004 0.003 0.008
Max Memory Used:
max used (Mb)
Ncells 102147 5.5
Vcells 545747 4.1
Log messages:
stderr and stdout:
character(0)
> (gc() - gc0)[,5:6]
max used (Mb)
Ncells 64538 3.4
Vcells 74546 0.5
The new versions of BiocParallel should be available via biocLite() Thursday July 23 by noon PST or immediately via svn. You should be able to pass a BPPARAM to the DESeq2 function that has logging enabled, eg, SnowParam(log=TRUE). Once you do this, you should have more information about memory use on the workers. Let me know if this does not make sense or you have trouble interpreting results.
Valerie
What operating system are you working under? Also how big is your experiment, that seems like more memory that DESeq2 should typically consume
Hi, I am operating on a Linux based OS (Ubuntu) and the experiment is 600 samples by 100,000 species (it is a metagenomic/amplicon study).