Best practice for handling large data (matrix with >2^31-1 non-zero elements) in SingleCellExperiment?
1
1
Entering edit mode
Fabian ▴ 10
@f11c7f6b
Last seen 3 months ago
Switzerland

Hello everyone,

I'm facing a hard limit when building large single-cell atlases in R by merging multiple SingleCellExperiment objects. The process fails when the combined object would exceed 2^31 - 1 non-zero elements in its count matrix. This is a common issue when working with over ~1 million cells. I understand this limitation comes from the dgCMatrix class in the Matrix package, which uses 32-bit integers for indexing. While switching to Python/Scanpy is an option, I would prefer to find a solution within the R/Bioconductor ecosystem. What are the community-recommended best practices for circumventing this?

Below is a minimal reproducible example that demonstrates the error.

Thanks for your help!

Reproducible Example:

library(SingleCellExperiment)
library(Matrix)

# This helper function creates one large SCE object.
# It generates unique indices to avoid the issue of `sparseMatrix` summing up
# values from duplicate indices, thus ensuring the number of stored non-zero
# elements is exactly what is specified.
create_sce <- function(elements, rows = 60000) {
  cols <- ceiling(elements / rows)
  unique_pos <- sample.int(as.double(rows) * cols, elements)

  mat <- sparseMatrix(
    i = ((unique_pos - 1) %% rows) + 1,
    j = ((unique_pos - 1) %/% rows) + 1,
    x = 1,
    dims = c(rows, cols)
  )
  SingleCellExperiment(assays = list(counts = mat))
}

# Define the number of non-zero elements for each object
elements_per_sce <- .Machine$integer.max / 2 + 1e5

# Create the two objects. Caution: This is memory-intensive.
sce1 <- create_sce(elements_per_sce)
sce2 <- create_sce(elements_per_sce)

# This command will now reliably fail with the expected error
try(cbind(sce1, sce2))

Error Message:

Error in cbind.Matrix(x, y, deparse.level = 0L) : 
  p[length(p)] cannot exceed 2^31-1

Session Info:

> sessionInfo()
Failed to query server: Transport endpoint is not connected
R version 4.3.3 (2024-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.8 (Ootpa)

Matrix products: default
BLAS/LAPACK: /home/user/workflow/.snakemake/conda/ebc4120a694f5d3b5f45cc199e08193a_/lib/libopenblasp-r0.3.29.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Zurich
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] Matrix_1.6-5                SingleCellExperiment_1.24.0
 [3] SummarizedExperiment_1.32.0 Biobase_2.62.0             
 [5] GenomicRanges_1.54.1        GenomeInfoDb_1.38.1        
 [7] IRanges_2.36.0              S4Vectors_0.40.2           
 [9] BiocGenerics_0.48.1         MatrixGenerics_1.14.0      
[11] matrixStats_1.5.0          

loaded via a namespace (and not attached):
 [1] SparseArray_1.2.2       zlibbioc_1.48.0         lattice_0.22-7         
 [4] abind_1.4-5             GenomeInfoDbData_1.2.11 S4Arrays_1.2.0         
 [7] XVector_0.42.0          RCurl_1.98-1.16         bitops_1.0-9           
[10] grid_4.3.3              DelayedArray_0.28.0     compiler_4.3.3         
[13] tools_4.3.3             crayon_1.5.3           
Warning message:
In system("timedatectl", intern = TRUE) :
  running command 'timedatectl' had status 1
SingleCellExperiment • 4.0k views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 29k
@alun
Last seen 14 minutes ago
The city by the bay

Consider using a SVT_SparseMatrix from the SparseArray package, which is explicitly designed for this. DropletUtils (in BioC-devel) can read directly into a SVT_SparseMatrix from the Matrix Market format, and I believe you can also coerce a HDF5-backed matrix into one with the usual as(..., "SVT_SparseMatrix").

ADD COMMENT
0
Entering edit mode

Was anyone else able to put the pieces together using the reply above? I read in one 10X sample using DropletUtils::read10xCounts() and this produced a SingleCellExperiment object. The counts are stored in a DelayedMatrix and I'm not clear how to tell DropletUtils or DelayedArray to store the counts as an SVT_SparseMatrix. I'm not sure whether I need to perform surgery on the SingleCellExperiment object by replacing the DelayedMatrix counts with an SVT_SparseMatrix or if there is an easier way to do this. I eventually need to read in and merge 150 samples which ~1.5 million cells altogether. As with the original poster, I'd like to stay in the Bioconductor ecosystem.

ADD REPLY
0
Entering edit mode

Did you use the development version as Aaron suggested? There is an argument to read the market matrix file directly into an SVT_SparseMatrix object.

read10xCounts <- function(samples, 
    sample.names=names(samples), 
    col.names=FALSE, 
    row.names = c("id", "symbol"),
    type=c("auto", "mtx", "hdf5", "prefix", "sparse", "HDF5"), 
    delayed=FALSE,
    version=c("auto", "2", "3"), 
    genome=NULL, 
    compressed=NULL, 
    intersect.genes=FALSE,
    mtx.two.pass=FALSE,
    mtx.class=c("CsparseMatrix", "SVT_SparseMatrix"),
    mtx.threads=1,
    BPPARAM=SerialParam())
{

Login before adding your answer.

Traffic: 646 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6