how to achieve reproducibility with BiocParallel regardless of number of threads and OS (set.seed is disallowed)

0

Entering edit mode

luluchen • 0

@luluchen-15365

Last seen 6.2 years ago

Dear All,

My package will use BiocParallel for permutation test. To make permutations reproducible, I read some discussions online (https://stat.ethz.ch/pipermail/bioc-devel/2015-June/007583.html, https://stackoverflow.com/questions/30610375/how-to-run-permutations-using-mclapply-in-a-reproducible-way-regardless-of-numbe). There are two easy solutions:

(1) Generate the permuted sample indexes before parallel computations. In this way, set.seed() can be used outside of my R function.

(2) Generate a vector of seeds (provided by users through argument of my R function) and use them by set.seed() in each parallel computation.

My permutation scheme is a little complex, so storing sample indexes needs a huge storage space. I have to use the second solution. However, set.seed() will cause warning in BiocCheck(). Could anybody give me suggestions?

Thanks a lot for your help!

Best,

Lulu

BiocParallel • 2.1k views

ADD COMMENT • link updated 6.2 years ago by Aaron Lun ★ 28k • written 6.2 years ago by luluchen • 0

0

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 5 hours ago

The city by the bay

This question is more appropriate for the Bioc-devel mailing list. Please direct future correspondence there.

In the meantime, here's some food for thought. I often run into this problem in several of my packages containing with heavy-duty randomized algorithms, e.g., correlation functions in scran, emptyDrops in DropletUtils. Fortunately for me, all of the functions that do enough computationally intensive work to warrant parallelization have also been re-written in C++. Thus, in the serial phase of each function, I use R's PRNG to generate random integers, one per computational job (e.g., per sample or per iteration). In each job, the sampled integer is used to seed std::mt19937 in each worker for random number generation at the C++ level. No matter how the jobs are split up between workers, I will always obtain the same series of random numbers. This ensures that I obtain the same results regardless of the BiocParallelParam object used. It also leaves R's .Rand.seed in the same state regardless of the parallelization scheme, which ensures that the output of downstream random functions are not dependent on whether I parallelized an upstream function.

ADD COMMENT • link 6.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks for your answers. As you are also a contributer of limma package, can I ask you for the suggestion on my package. In my permutation test, each computational job will permute samples and then call lmFit and eBayes of limma. Do you think re-writting a Rcpp function which calls lmFit and eBayes within C++ code will reduce running time? Thanks!

ADD REPLY • link 6.2 years ago luluchen • 0

0

Entering edit mode

No, it's probably not worth it. lmFit just calls lm.fit and that's already vectorized across genes. eBayes doesn't do anything that needs specialist C++ code either. If you're running lmFit with weights, perhaps there's an argument for migrating to C++ to avoid looping across genes; but even then, the code is already pretty fast so I haven't bothered.

ADD REPLY • link 6.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Writing C++ code to avoid the warning not to engage in the bad practice of setting the random seed inside a function really doesn't sound like a good idea! Arrange your code so that setting the seed outside the function is sufficient to create exactly reproducible results.

ADD REPLY • link 6.2 years ago Martin Morgan 25k

0

Entering edit mode

In my case, I was using C++ anyway for other things, so it wasn't a big deal to switch from R's random number API to C++'s standard <random> library. (In fact, one could argue that it would be stylistically more natural to use <random> in C++ code in the first place.) Avoiding the BiocCheck warning was just a convenient side-effect. For sure, I would find it strange to write C++ code purely to avoid the warning, but I've never been in that situation yet.

ADD REPLY • link 6.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks for your comments! Now I know how to set a specific seed for each parallel job in R code or C++ code.

ADD REPLY • link 6.2 years ago luluchen • 0

Login before adding your answer.