Ensuring reproducibility in BiocParallel SingleCellExperiment workflows
Entering edit mode
enricoferrero ▴ 580
Last seen 9 days ago

I'm using a fairly standard scRNA-seq data analysis workflow, inspired by the OSCA pipeline (i.e., using scuttle, scater, scran among others). As the dataset is fairly big (>100'000 cells), I'm using BiocParalell to speed up computation.

Now, the problem I encounter is that every time I rerun my code (typically as part of knitting an .Rmd file), I get slightly different results, more specifically:

  • The UMAP projection looks different
  • The number of clusters and the cell-to-cluster assignments are different.

This is despite setting the seed at the very beginning (bp is what I pass to all the functions that have a BPPARAM argument):

bp <- BiocParallel::MulticoreParam(workers = 8, RNGseed = 16)

The differences between runs are not major, but, as an example, they prevent me to reliably map cluster numbers to cell types after running SingleR (because, say, cluster 7 might indicate completely unrelated cells in different runs).

I'm guessing this might be due to some stochastic component of the UMAP and Louvain clustering algorithms, though I would have thought setting the seed was enough. Interestingly, I can't quite reproduce this on a small, toy dataset, possibly because the algorithms converge more easily and/or in less time.

How can I ensure reproducibility of dimensionality reduction and clustering in SingleCellExperiment workflows using BiocParallel? Thank you.

scater SingleCellExperiment BiocParallel scuttle scran • 199 views
Entering edit mode
Aaron Lun ★ 27k
Last seen 4 hours ago
The city by the bay

I don't think that setting RNGseed inside MulticoreParam does what anyone really expects. See the discussion in https://github.com/Bioconductor/BiocParallel/pull/140.

More generally, it would help if you could identify the offending function. I would guess that it is the PCA if it affects both the UMAP and the clustering.

Entering edit mode

The lack of reproducibility has been true in the past, but I believe under the (just released) BiocParallel 1.28.0 setting RNGseed = will make the results reproducible, including across workers and 'back-ends'. (Unless the author of a package has subverted this, perhaps as a legitimate attempt to 'correct' the misbehavior of previous BiocParallel). A new vignette describes random number behavior in detail.

Entering edit mode

Thank you both Martin Morgan and Aaron Lun! I can confirm upgrading to Bioconductor 3.14 and BiocParallel 1.28.0 now gives me reproducible results across runs. Hooray!


Login before adding your answer.

Traffic: 304 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6