Hello everyone,
I'm facing a hard limit when building large single-cell atlases in R by merging multiple SingleCellExperiment objects. The process fails when the combined object would exceed 2^31 - 1 non-zero elements in its count matrix. This is a common issue when working with over ~1 million cells. I understand this limitation comes from the dgCMatrix class in the Matrix package, which uses 32-bit integers for indexing. While switching to Python/Scanpy is an option, I would prefer to find a solution within the R/Bioconductor ecosystem. What are the community-recommended best practices for circumventing this?
Below is a minimal reproducible example that demonstrates the error.
Thanks for your help!
Reproducible Example:
library(SingleCellExperiment)
library(Matrix)
# This helper function creates one large SCE object.
# It generates unique indices to avoid the issue of `sparseMatrix` summing up
# values from duplicate indices, thus ensuring the number of stored non-zero
# elements is exactly what is specified.
create_sce <- function(elements, rows = 60000) {
cols <- ceiling(elements / rows)
unique_pos <- sample.int(as.double(rows) * cols, elements)
mat <- sparseMatrix(
i = ((unique_pos - 1) %% rows) + 1,
j = ((unique_pos - 1) %/% rows) + 1,
x = 1,
dims = c(rows, cols)
)
SingleCellExperiment(assays = list(counts = mat))
}
# Define the number of non-zero elements for each object
elements_per_sce <- .Machine$integer.max / 2 + 1e5
# Create the two objects. Caution: This is memory-intensive.
sce1 <- create_sce(elements_per_sce)
sce2 <- create_sce(elements_per_sce)
# This command will now reliably fail with the expected error
try(cbind(sce1, sce2))
Error Message:
Error in cbind.Matrix(x, y, deparse.level = 0L) :
p[length(p)] cannot exceed 2^31-1
Session Info:
> sessionInfo()
Failed to query server: Transport endpoint is not connected
R version 4.3.3 (2024-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.8 (Ootpa)
Matrix products: default
BLAS/LAPACK: /home/user/workflow/.snakemake/conda/ebc4120a694f5d3b5f45cc199e08193a_/lib/libopenblasp-r0.3.29.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Zurich
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] Matrix_1.6-5 SingleCellExperiment_1.24.0
[3] SummarizedExperiment_1.32.0 Biobase_2.62.0
[5] GenomicRanges_1.54.1 GenomeInfoDb_1.38.1
[7] IRanges_2.36.0 S4Vectors_0.40.2
[9] BiocGenerics_0.48.1 MatrixGenerics_1.14.0
[11] matrixStats_1.5.0
loaded via a namespace (and not attached):
[1] SparseArray_1.2.2 zlibbioc_1.48.0 lattice_0.22-7
[4] abind_1.4-5 GenomeInfoDbData_1.2.11 S4Arrays_1.2.0
[7] XVector_0.42.0 RCurl_1.98-1.16 bitops_1.0-9
[10] grid_4.3.3 DelayedArray_0.28.0 compiler_4.3.3
[13] tools_4.3.3 crayon_1.5.3
Warning message:
In system("timedatectl", intern = TRUE) :
running command 'timedatectl' had status 1
Was anyone else able to put the pieces together using the reply above? I read in one 10X sample using DropletUtils::read10xCounts() and this produced a SingleCellExperiment object. The counts are stored in a DelayedMatrix and I'm not clear how to tell DropletUtils or DelayedArray to store the counts as an SVT_SparseMatrix. I'm not sure whether I need to perform surgery on the SingleCellExperiment object by replacing the DelayedMatrix counts with an SVT_SparseMatrix or if there is an easier way to do this. I eventually need to read in and merge 150 samples which ~1.5 million cells altogether. As with the original poster, I'd like to stay in the Bioconductor ecosystem.
Did you use the development version as Aaron suggested? There is an argument to read the market matrix file directly into an
SVT_SparseMatrix
object.