Dear All,
My package will use BiocParallel for permutation test. To make permutations reproducible, I read some discussions online (https://stat.ethz.ch/pipermail/bioc-devel/2015-June/007583.html, https://stackoverflow.com/questions/30610375/how-to-run-permutations-using-mclapply-in-a-reproducible-way-regardless-of-numbe). There are two easy solutions:
(1) Generate the permuted sample indexes before parallel computations. In this way, set.seed() can be used outside of my R function.
(2) Generate a vector of seeds (provided by users through argument of my R function) and use them by set.seed() in each parallel computation.
My permutation scheme is a little complex, so storing sample indexes needs a huge storage space. I have to use the second solution. However, set.seed() will cause warning in BiocCheck(). Could anybody give me suggestions?
Thanks a lot for your help!
Best,
Lulu
Thanks for your answers. As you are also a contributer of limma package, can I ask you for the suggestion on my package. In my permutation test, each computational job will permute samples and then call
lmFit
andeBayes
of limma. Do you think re-writting a Rcpp function which callslmFit
andeBayes
within C++ code will reduce running time? Thanks!No, it's probably not worth it.
lmFit
just callslm.fit
and that's already vectorized across genes.eBayes
doesn't do anything that needs specialist C++ code either. If you're runninglmFit
with weights, perhaps there's an argument for migrating to C++ to avoid looping across genes; but even then, the code is already pretty fast so I haven't bothered.Writing C++ code to avoid the warning not to engage in the bad practice of setting the random seed inside a function really doesn't sound like a good idea! Arrange your code so that setting the seed outside the function is sufficient to create exactly reproducible results.
In my case, I was using C++ anyway for other things, so it wasn't a big deal to switch from R's random number API to C++'s standard
<random>
library. (In fact, one could argue that it would be stylistically more natural to use<random>
in C++ code in the first place.) Avoiding the BiocCheck warning was just a convenient side-effect. For sure, I would find it strange to write C++ code purely to avoid the warning, but I've never been in that situation yet.Thanks for your comments! Now I know how to set a specific seed for each parallel job in R code or C++ code.