Question

Best practice for handling large data (matrix with >2^31-1 non-zero elements) in SingleCellExperiment?

1

Entering edit mode

Fabian ▴ 10

@f11c7f6b

Last seen 3 months ago

Switzerland

Hello everyone,

I'm facing a hard limit when building large single-cell atlases in R by merging multiple SingleCellExperiment objects. The process fails when the combined object would exceed 2^31 - 1 non-zero elements in its count matrix. This is a common issue when working with over ~1 million cells. I understand this limitation comes from the dgCMatrix class in the Matrix package, which uses 32-bit integers for indexing. While switching to Python/Scanpy is an option, I would prefer to find a solution within the R/Bioconductor ecosystem. What are the community-recommended best practices for circumventing this?

Below is a minimal reproducible example that demonstrates the error.

Thanks for your help!

Reproducible Example:

library(SingleCellExperiment)
library(Matrix)

# This helper function creates one large SCE object.
# It generates unique indices to avoid the issue of `sparseMatrix` summing up
# values from duplicate indices, thus ensuring the number of stored non-zero
# elements is exactly what is specified.
create_sce <- function(elements, rows = 60000) {
  cols <- ceiling(elements / rows)
  unique_pos <- sample.int(as.double(rows) * cols, elements)

  mat <- sparseMatrix(
    i = ((unique_pos - 1) %% rows) + 1,
    j = ((unique_pos - 1) %/% rows) + 1,
    x = 1,
    dims = c(rows, cols)
  )
  SingleCellExperiment(assays = list(counts = mat))
}

# Define the number of non-zero elements for each object
elements_per_sce <- .Machine$integer.max / 2 + 1e5

# Create the two objects. Caution: This is memory-intensive.
sce1 <- create_sce(elements_per_sce)
sce2 <- create_sce(elements_per_sce)

# This command will now reliably fail with the expected error
try(cbind(sce1, sce2))

Error Message:

Error in cbind.Matrix(x, y, deparse.level = 0L) : 
  p[length(p)] cannot exceed 2^31-1

Session Info:

> sessionInfo()
Failed to query server: Transport endpoint is not connected
R version 4.3.3 (2024-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.8 (Ootpa)

Matrix products: default
BLAS/LAPACK: /home/user/workflow/.snakemake/conda/ebc4120a694f5d3b5f45cc199e08193a_/lib/libopenblasp-r0.3.29.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Zurich
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] Matrix_1.6-5                SingleCellExperiment_1.24.0
 [3] SummarizedExperiment_1.32.0 Biobase_2.62.0             
 [5] GenomicRanges_1.54.1        GenomeInfoDb_1.38.1        
 [7] IRanges_2.36.0              S4Vectors_0.40.2           
 [9] BiocGenerics_0.48.1         MatrixGenerics_1.14.0      
[11] matrixStats_1.5.0          

loaded via a namespace (and not attached):
 [1] SparseArray_1.2.2       zlibbioc_1.48.0         lattice_0.22-7         
 [4] abind_1.4-5             GenomeInfoDbData_1.2.11 S4Arrays_1.2.0         
 [7] XVector_0.42.0          RCurl_1.98-1.16         bitops_1.0-9           
[10] grid_4.3.3              DelayedArray_0.28.0     compiler_4.3.3         
[13] tools_4.3.3             crayon_1.5.3           
Warning message:
In system("timedatectl", intern = TRUE) :
  running command 'timedatectl' had status 1

SingleCellExperiment • 4.0k views

ADD COMMENT • link updated 1 hour ago by James W. MacDonald 68k • written 3 months ago by Fabian ▴ 10

score 2 · Answer 1 · 2025-07-09

2

Entering edit mode

Aaron Lun ★ 29k

@alun

Last seen 14 minutes ago

The city by the bay

Consider using a SVT_SparseMatrix from the SparseArray package, which is explicitly designed for this. DropletUtils (in BioC-devel) can read directly into a SVT_SparseMatrix from the Matrix Market format, and I believe you can also coerce a HDF5-backed matrix into one with the usual as(..., "SVT_SparseMatrix").

ADD COMMENT • link 3 months ago Aaron Lun ★ 29k

0

Entering edit mode

Was anyone else able to put the pieces together using the reply above? I read in one 10X sample using DropletUtils::read10xCounts() and this produced a SingleCellExperiment object. The counts are stored in a DelayedMatrix and I'm not clear how to tell DropletUtils or DelayedArray to store the counts as an SVT_SparseMatrix. I'm not sure whether I need to perform surgery on the SingleCellExperiment object by replacing the DelayedMatrix counts with an SVT_SparseMatrix or if there is an easier way to do this. I eventually need to read in and merge 150 samples which ~1.5 million cells altogether. As with the original poster, I'd like to stay in the Bioconductor ecosystem.

ADD REPLY • link 3 hours ago dan.gatti • 0

0

Entering edit mode

Did you use the development version as Aaron suggested? There is an argument to read the market matrix file directly into an SVT_SparseMatrix object.

read10xCounts <- function(samples, 
    sample.names=names(samples), 
    col.names=FALSE, 
    row.names = c("id", "symbol"),
    type=c("auto", "mtx", "hdf5", "prefix", "sparse", "HDF5"), 
    delayed=FALSE,
    version=c("auto", "2", "3"), 
    genome=NULL, 
    compressed=NULL, 
    intersect.genes=FALSE,
    mtx.two.pass=FALSE,
    mtx.class=c("CsparseMatrix", "SVT_SparseMatrix"),
    mtx.threads=1,
    BPPARAM=SerialParam())
{

ADD REPLY • link 1 hour ago James W. MacDonald 68k