Ensuring reproducibility in BiocParallel SingleCellExperiment workflows
1
1
Entering edit mode
enricoferrero ▴ 660
@enricoferrero-6037
Last seen 16 months ago
Switzerland

I'm using a fairly standard scRNA-seq data analysis workflow, inspired by the OSCA pipeline (i.e., using scuttle, scater, scran among others). As the dataset is fairly big (>100'000 cells), I'm using BiocParalell to speed up computation.

Now, the problem I encounter is that every time I rerun my code (typically as part of knitting an .Rmd file), I get slightly different results, more specifically:

• The UMAP projection looks different
• The number of clusters and the cell-to-cluster assignments are different.

This is despite setting the seed at the very beginning (bp is what I pass to all the functions that have a BPPARAM argument):

set.seed(16)
bp <- BiocParallel::MulticoreParam(workers = 8, RNGseed = 16)


The differences between runs are not major, but, as an example, they prevent me to reliably map cluster numbers to cell types after running SingleR (because, say, cluster 7 might indicate completely unrelated cells in different runs).

I'm guessing this might be due to some stochastic component of the UMAP and Louvain clustering algorithms, though I would have thought setting the seed was enough. Interestingly, I can't quite reproduce this on a small, toy dataset, possibly because the algorithms converge more easily and/or in less time.

How can I ensure reproducibility of dimensionality reduction and clustering in SingleCellExperiment workflows using BiocParallel? Thank you.

1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 20 hours ago
The city by the bay

I don't think that setting RNGseed inside MulticoreParam does what anyone really expects. See the discussion in https://github.com/Bioconductor/BiocParallel/pull/140.

More generally, it would help if you could identify the offending function. I would guess that it is the PCA if it affects both the UMAP and the clustering.

1
Entering edit mode

The lack of reproducibility has been true in the past, but I believe under the (just released) BiocParallel 1.28.0 setting RNGseed = will make the results reproducible, including across workers and 'back-ends'. (Unless the author of a package has subverted this, perhaps as a legitimate attempt to 'correct' the misbehavior of previous BiocParallel). A new vignette describes random number behavior in detail.

0
Entering edit mode

Thank you both Martin Morgan and Aaron Lun! I can confirm upgrading to Bioconductor 3.14 and BiocParallel 1.28.0 now gives me reproducible results across runs. Hooray!