Ensuring reproducibility in BiocParallel SingleCellExperiment workflows
1
1
Entering edit mode
enricoferrero ▴ 580
@enricoferrero-6037
Last seen 9 days ago
Switzerland

I'm using a fairly standard scRNA-seq data analysis workflow, inspired by the OSCA pipeline (i.e., using scuttle, scater, scran among others). As the dataset is fairly big (>100'000 cells), I'm using BiocParalell to speed up computation.

Now, the problem I encounter is that every time I rerun my code (typically as part of knitting an .Rmd file), I get slightly different results, more specifically:

  • The UMAP projection looks different
  • The number of clusters and the cell-to-cluster assignments are different.

This is despite setting the seed at the very beginning (bp is what I pass to all the functions that have a BPPARAM argument):

set.seed(16)
bp <- BiocParallel::MulticoreParam(workers = 8, RNGseed = 16)

The differences between runs are not major, but, as an example, they prevent me to reliably map cluster numbers to cell types after running SingleR (because, say, cluster 7 might indicate completely unrelated cells in different runs).

I'm guessing this might be due to some stochastic component of the UMAP and Louvain clustering algorithms, though I would have thought setting the seed was enough. Interestingly, I can't quite reproduce this on a small, toy dataset, possibly because the algorithms converge more easily and/or in less time.

How can I ensure reproducibility of dimensionality reduction and clustering in SingleCellExperiment workflows using BiocParallel? Thank you.

scater SingleCellExperiment BiocParallel scuttle scran • 201 views
ADD COMMENT
1
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 6 hours ago
The city by the bay

I don't think that setting RNGseed inside MulticoreParam does what anyone really expects. See the discussion in https://github.com/Bioconductor/BiocParallel/pull/140.

More generally, it would help if you could identify the offending function. I would guess that it is the PCA if it affects both the UMAP and the clustering.

ADD COMMENT
1
Entering edit mode

The lack of reproducibility has been true in the past, but I believe under the (just released) BiocParallel 1.28.0 setting RNGseed = will make the results reproducible, including across workers and 'back-ends'. (Unless the author of a package has subverted this, perhaps as a legitimate attempt to 'correct' the misbehavior of previous BiocParallel). A new vignette describes random number behavior in detail.

ADD REPLY
0
Entering edit mode

Thank you both Martin Morgan and Aaron Lun! I can confirm upgrading to Bioconductor 3.14 and BiocParallel 1.28.0 now gives me reproducible results across runs. Hooray!

ADD REPLY

Login before adding your answer.

Traffic: 298 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6