how to achieve reproducibility with BiocParallel regardless of number of threads and OS (set.seed is disallowed)
1
0
Entering edit mode
luluchen • 0
@luluchen-15365
Last seen 5.3 years ago

Dear All,

My package will use BiocParallel for permutation test.  To make permutations reproducible, I read some discussions online (https://stat.ethz.ch/pipermail/bioc-devel/2015-June/007583.htmlhttps://stackoverflow.com/questions/30610375/how-to-run-permutations-using-mclapply-in-a-reproducible-way-regardless-of-numbe). There are two easy solutions:

(1) Generate the permuted sample indexes before parallel computations. In this way, set.seed() can be used outside of my R function.

(2) Generate a vector of seeds (provided by users through argument of my R function) and use them by set.seed() in each parallel computation.

My permutation scheme is a little complex, so storing sample indexes needs a huge storage space. I have to use the second solution. However, set.seed() will cause warning in BiocCheck(). Could anybody give me suggestions?

Thanks a lot for your help!

Best,

Lulu

BiocParallel • 1.6k views
ADD COMMENT
0
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 1 hour ago
The city by the bay

This question is more appropriate for the Bioc-devel mailing list. Please direct future correspondence there.

In the meantime, here's some food for thought. I often run into this problem in several of my packages containing with heavy-duty randomized algorithms, e.g., correlation functions in scran, emptyDrops in DropletUtils. Fortunately for me, all of the functions that do enough computationally intensive work to warrant parallelization have also been re-written in C++. Thus, in the serial phase of each function, I use R's PRNG to generate random integers, one per computational job (e.g., per sample or per iteration). In each job, the sampled integer is used to seed std::mt19937 in each worker for random number generation at the C++ level. No matter how the jobs are split up between workers, I will always obtain the same series of random numbers. This ensures that I obtain the same results regardless of the BiocParallelParam object used. It also leaves R's .Rand.seed in the same state regardless of the parallelization scheme, which ensures that the output of downstream random functions are not dependent on whether I parallelized an upstream function.

ADD COMMENT
0
Entering edit mode

Thanks for your answers. As you are also a contributer of limma package, can I ask you for the suggestion on my package. In my permutation test, each computational job will permute samples and then call lmFit and eBayes of limma. Do you think re-writting a Rcpp function which calls lmFit and eBayes within C++ code will reduce running time? Thanks!

ADD REPLY
0
Entering edit mode

No, it's probably not worth it. lmFit just calls lm.fit and that's already vectorized across genes. eBayes doesn't do anything that needs specialist C++ code either. If you're running lmFit with weights, perhaps there's an argument for migrating to C++ to avoid looping across genes; but even then, the code is already pretty fast so I haven't bothered.

ADD REPLY
0
Entering edit mode

Writing C++ code to avoid the warning not to engage in the bad practice of setting the random seed inside a function really doesn't sound like a good idea! Arrange your code so that setting the seed outside the function is sufficient to create exactly reproducible results.

ADD REPLY
0
Entering edit mode

In my case, I was using C++ anyway for other things, so it wasn't a big deal to switch from R's random number API to C++'s standard <random> library. (In fact, one could argue that it would be stylistically more natural to use <random> in C++ code in the first place.) Avoiding the BiocCheck warning was just a convenient side-effect. For sure, I would find it strange to write C++ code purely to avoid the warning, but I've never been in that situation yet.

ADD REPLY
0
Entering edit mode

Thanks for your comments! Now I know how to set a specific seed for each parallel job in R code or C++ code. 

ADD REPLY

Login before adding your answer.

Traffic: 512 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6