Question: how to achieve reproducibility with BiocParallel regardless of number of threads and OS (set.seed is disallowed)
0
gravatar for luluchen
3 months ago by
luluchen0
luluchen0 wrote:

Dear All,

My package will use BiocParallel for permutation test.  To make permutations reproducible, I read some discussions online (https://stat.ethz.ch/pipermail/bioc-devel/2015-June/007583.htmlhttps://stackoverflow.com/questions/30610375/how-to-run-permutations-using-mclapply-in-a-reproducible-way-regardless-of-numbe). There are two easy solutions:

(1) Generate the permuted sample indexes before parallel computations. In this way, set.seed() can be used outside of my R function.

(2) Generate a vector of seeds (provided by users through argument of my R function) and use them by set.seed() in each parallel computation.

My permutation scheme is a little complex, so storing sample indexes needs a huge storage space. I have to use the second solution. However, set.seed() will cause warning in BiocCheck(). Could anybody give me suggestions?

Thanks a lot for your help!

Best,

Lulu

biocparallel • 214 views
ADD COMMENTlink modified 3 months ago by Aaron Lun23k • written 3 months ago by luluchen0
Answer: how to achieve reproducibility with BiocParallel regardless of number of threads
0
gravatar for Aaron Lun
3 months ago by
Aaron Lun23k
Cambridge, United Kingdom
Aaron Lun23k wrote:

This question is more appropriate for the Bioc-devel mailing list. Please direct future correspondence there.

In the meantime, here's some food for thought. I often run into this problem in several of my packages containing with heavy-duty randomized algorithms, e.g., correlation functions in scran, emptyDrops in DropletUtils. Fortunately for me, all of the functions that do enough computationally intensive work to warrant parallelization have also been re-written in C++. Thus, in the serial phase of each function, I use R's PRNG to generate random integers, one per computational job (e.g., per sample or per iteration). In each job, the sampled integer is used to seed std::mt19937 in each worker for random number generation at the C++ level. No matter how the jobs are split up between workers, I will always obtain the same series of random numbers. This ensures that I obtain the same results regardless of the BiocParallelParam object used. It also leaves R's .Rand.seed in the same state regardless of the parallelization scheme, which ensures that the output of downstream random functions are not dependent on whether I parallelized an upstream function.

ADD COMMENTlink modified 3 months ago • written 3 months ago by Aaron Lun23k

Thanks for your answers. As you are also a contributer of limma package, can I ask you for the suggestion on my package. In my permutation test, each computational job will permute samples and then call lmFit and eBayes of limma. Do you think re-writting a Rcpp function which calls lmFit and eBayes within C++ code will reduce running time? Thanks!

ADD REPLYlink modified 3 months ago • written 3 months ago by luluchen0

No, it's probably not worth it. lmFit just calls lm.fit and that's already vectorized across genes. eBayes doesn't do anything that needs specialist C++ code either. If you're running lmFit with weights, perhaps there's an argument for migrating to C++ to avoid looping across genes; but even then, the code is already pretty fast so I haven't bothered.

ADD REPLYlink modified 3 months ago • written 3 months ago by Aaron Lun23k

Writing C++ code to avoid the warning not to engage in the bad practice of setting the random seed inside a function really doesn't sound like a good idea! Arrange your code so that setting the seed outside the function is sufficient to create exactly reproducible results.

ADD REPLYlink written 3 months ago by Martin Morgan ♦♦ 23k

In my case, I was using C++ anyway for other things, so it wasn't a big deal to switch from R's random number API to C++'s standard <random> library. (In fact, one could argue that it would be stylistically more natural to use <random> in C++ code in the first place.) Avoiding the BiocCheck warning was just a convenient side-effect. For sure, I would find it strange to write C++ code purely to avoid the warning, but I've never been in that situation yet.

ADD REPLYlink modified 3 months ago • written 3 months ago by Aaron Lun23k

Thanks for your comments! Now I know how to set a specific seed for each parallel job in R code or C++ code. 

ADD REPLYlink written 3 months ago by luluchen0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 285 users visited in the last hour