I'm using the batchelor::fastMNN
function to integrate cells across batches as part of a book that I'm writing. I set up a Docker container for reproducibility purposes. While the results of the fastMNN
function are reproducible locally inside the container, it gives different results when run on Github actions. I wrote tests to check the reproducibility of the results locally and it seems like some of the PCs are flipped on GA compared to the local results. I have the feeling similar issues arise when using runUMAP
and saw online that there can be issues like this coming from irlba
.
Is this to be expected when using different machines even though the computations are performed within the same Docker container?
library(batchelor)
set.seed(220228)
out <- fastMNN(spe, batch = spe$patient_id,
auto.merge = TRUE,
subset.row = rowData(spe)$use_channel,
assay.type = "exprs")
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] testthat_3.1.9 SeuratObject_4.1.3
## [3] Seurat_4.3.0.1 BiocSingular_1.16.0
## [5] harmony_0.1.1 Rcpp_1.0.10
## [7] viridis_0.6.3 viridisLite_0.4.2
## [9] dittoSeq_1.12.0 cowplot_1.1.1
## [11] scater_1.28.0 ggplot2_3.4.2
## [13] scuttle_1.10.1 SpatialExperiment_1.10.0
## [15] batchelor_1.16.0 SingleCellExperiment_1.22.0
## [17] SummarizedExperiment_1.30.2 Biobase_2.60.0
## [19] GenomicRanges_1.52.0 GenomeInfoDb_1.36.1
## [21] IRanges_2.34.1 S4Vectors_0.38.1
## [23] BiocGenerics_0.46.0 MatrixGenerics_1.12.2
## [25] matrixStats_1.0.0
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.20 splines_4.3.1
## [3] later_1.3.1 bitops_1.0-7
## [5] tibble_3.2.1 R.oo_1.25.0
## [7] polyclip_1.10-4 lifecycle_1.0.3
## [9] rprojroot_2.0.3 edgeR_3.42.4
## [11] globals_0.16.2 lattice_0.21-8
## [13] MASS_7.3-60 magrittr_2.0.3
## [15] limma_3.56.2 plotly_4.10.2
## [17] sass_0.4.6 rmarkdown_2.22
## [19] jquerylib_0.1.4 yaml_2.3.7
## [21] httpuv_1.6.11 sctransform_0.3.5
## [23] spatstat.sparse_3.0-1 sp_2.0-0
## [25] reticulate_1.30 pbapply_1.7-0
## [27] RColorBrewer_1.1-3 ResidualMatrix_1.10.0
## [29] pkgload_1.3.2 abind_1.4-5
## [31] zlibbioc_1.46.0 Rtsne_0.16
## [33] purrr_1.0.1 R.utils_2.12.2
## [35] RCurl_1.98-1.12 GenomeInfoDbData_1.2.10
## [37] ggrepel_0.9.3 irlba_2.3.5.1
## [39] spatstat.utils_3.0-3 listenv_0.9.0
## [41] pheatmap_1.0.12 goftest_1.2-3
## [43] spatstat.random_3.1-5 dqrng_0.3.0
## [45] fitdistrplus_1.1-11 parallelly_1.36.0
## [47] DelayedMatrixStats_1.22.1 leiden_0.4.3
## [49] codetools_0.2-19 DropletUtils_1.20.0
## [51] DelayedArray_0.26.3 tidyselect_1.2.0
## [53] farver_2.1.1 ScaledMatrix_1.8.1
## [55] spatstat.explore_3.2-1 jsonlite_1.8.5
## [57] BiocNeighbors_1.18.0 ellipsis_0.3.2
## [59] progressr_0.13.0 ggridges_0.5.4
## [61] survival_3.5-5 tools_4.3.1
## [63] ica_1.0-3 glue_1.6.2
## [65] gridExtra_2.3 xfun_0.39
## [67] dplyr_1.1.2 HDF5Array_1.28.1
## [69] withr_2.5.0 fastmap_1.1.1
## [71] rhdf5filters_1.12.1 fansi_1.0.4
## [73] digest_0.6.31 rsvd_1.0.5
## [75] R6_2.5.1 mime_0.12
## [77] colorspace_2.1-0 scattermore_1.2
## [79] tensor_1.5 spatstat.data_3.0-1
## [81] R.methodsS3_1.8.2 utf8_1.2.3
## [83] tidyr_1.3.0 generics_0.1.3
## [85] data.table_1.14.8 httr_1.4.6
## [87] htmlwidgets_1.6.2 S4Arrays_1.0.4
## [89] uwot_0.1.14 pkgconfig_2.0.3
## [91] gtable_0.3.3 lmtest_0.9-40
## [93] XVector_0.40.0 brio_1.1.3
## [95] htmltools_0.5.5 bookdown_0.34
## [97] scales_1.2.1 png_0.1-8
## [99] knitr_1.43 rstudioapi_0.14
## [101] reshape2_1.4.4 rjson_0.2.21
## [103] nlme_3.1-162 cachem_1.0.8
## [105] zoo_1.8-12 rhdf5_2.44.0
## [107] stringr_1.5.0 KernSmooth_2.23-21
## [109] parallel_4.3.1 miniUI_0.1.1.1
## [111] vipor_0.4.5 desc_1.4.2
## [113] pillar_1.9.0 grid_4.3.1
## [115] vctrs_0.6.3 RANN_2.6.1
## [117] promises_1.2.0.1 beachmat_2.16.0
## [119] xtable_1.8-4 cluster_2.1.4
## [121] waldo_0.5.1 beeswarm_0.4.0
## [123] evaluate_0.21 magick_2.7.4
## [125] cli_3.6.1 locfit_1.5-9.8
## [127] compiler_4.3.1 rlang_1.1.1
## [129] crayon_1.5.2 future.apply_1.11.0
## [131] labeling_0.4.2 plyr_1.8.8
## [133] ggbeeswarm_0.7.2 stringi_1.7.12
## [135] deldir_1.0-9 BiocParallel_1.34.2
## [137] munsell_0.5.0 lazyeval_0.2.2
## [139] spatstat.geom_3.2-1 Matrix_1.5-4.1
## [141] patchwork_1.1.2 sparseMatrixStats_1.12.1
## [143] future_1.32.0 Rhdf5lib_1.22.0
## [145] shiny_1.7.4 highr_0.10
## [147] ROCR_1.0-11 igraph_1.5.0
## [149] bslib_0.5.0
Is the local one a Mac? Not answering your question specifically, but I had it previously that despite full containerization I got different results from standalone sctransform on a local Mac compared to HPC and workstation. Very similar to: https://github.com/satijalab/sctransform/issues/68 -- never found out why, or at least idr.
Thanks for your input! Yes, the local one is a Mac. That actually reminds me that I could switch the OS on GA to Mac. Let's see if that helps.
You seem to be trying to get (numerical) reproducibility across platforms, which can be pretty hard (and can sometimes reveal issues with the code). Specifically, you're running into problems with using a (fast) PCA implemented in irlba. PCs are not defined up to sign, so I don't think you can expect that platforms (and possibly different BLAST) are guaranteed to have the same sign
Thanks for the reply! That's what I thought - just wanted to get an expert opinion on this issue. I'll still suggest the user to use the (default) fast PCA implementation and will skip all tests that check irlba results on CI.
Respectfully, I would work harder on making the test reproducible. Disabling the test should in my opinion only be done as a last resort.
For sign of PCA on a well-defined dataset, sometimes you may have a covariate and you get test cor(PC, covariate) and if the covariance is negative you can switch the sign on the PC.
Will do, thanks. Testing the sign of the PCA might get a bit tricky as this is the output of the
fastMNN
function and not directly a simple PCA. My next test will be to get Docker running on MacOS on GA and check if the results are reproducible.