I'm using a fairly standard scRNA-seq data analysis workflow, inspired by the OSCA pipeline (i.e., using
scran among others).
As the dataset is fairly big (>100'000 cells), I'm using
BiocParalell to speed up computation.
Now, the problem I encounter is that every time I rerun my code (typically as part of knitting an .Rmd file), I get slightly different results, more specifically:
- The UMAP projection looks different
- The number of clusters and the cell-to-cluster assignments are different.
This is despite setting the seed at the very beginning (
bp is what I pass to all the functions that have a
set.seed(16) bp <- BiocParallel::MulticoreParam(workers = 8, RNGseed = 16)
The differences between runs are not major, but, as an example, they prevent me to reliably map cluster numbers to cell types after running
SingleR (because, say, cluster 7 might indicate completely unrelated cells in different runs).
I'm guessing this might be due to some stochastic component of the UMAP and Louvain clustering algorithms, though I would have thought setting the seed was enough. Interestingly, I can't quite reproduce this on a small, toy dataset, possibly because the algorithms converge more easily and/or in less time.
How can I ensure reproducibility of dimensionality reduction and clustering in
SingleCellExperiment workflows using
The lack of reproducibility has been true in the past, but I believe under the (just released) BiocParallel 1.28.0 setting
RNGseed =will make the results reproducible, including across workers and 'back-ends'. (Unless the author of a package has subverted this, perhaps as a legitimate attempt to 'correct' the misbehavior of previous BiocParallel). A new vignette describes random number behavior in detail.
Thank you both Martin Morgan and Aaron Lun! I can confirm upgrading to Bioconductor 3.14 and BiocParallel 1.28.0 now gives me reproducible results across runs. Hooray!