Previous implementation (in-memory)

Question

Efficiently working with HDF5-backed SingleCellExperiments

1

Entering edit mode

merv ▴ 150

@mmfansler-13248

Last seen 5 months ago

MSKCC | New York, NY

I have a 50K features X 2M cells SingleCellExperiment object backed by HDF5 (via the HDF5Array::saveHDF5SummarizeExperiment). I am trying to do a summarization to pseudobulk, but am having difficulties. I would like some guidance/tips for improving the efficiency of this operation.

Previous implementation (in-memory)

With sparseMatrix-based counts (i.e., Matrix package), I usually do something like the following:

## counts from SingleCellExperiment
cts_feature_cell <- counts(sce)

## define 'cell' to 'group' mapping
M_cell_group <- t(Matrix::fac2sparse(sce$group))

## summarize to 'group'
cts_feature_group <- cts_feature_cell %*% M_cell_group

That is fast and memory efficient.

HDF5Array Counts Summarization

Trying to translate the above to HDF5-backed counts seems straightforward, but I feel I some of the details needed to optimize it are escaping me (e.g., interaction of sparsity, chunk dimensions, block size).

For now, I have been working with code like:

library(Matrix)
library(DelayedArray)
library(HDF5Array)
library(SingleCellExperiment)
library(BiocParallel)

##############
## SETTINGS ##
##############

## work with 1GB blocks
setAutoBlockSize(1e9)
## use local SSD
setHDF5DumpDir("/scratch")
## not too worried for space
setHDF5DumpCompressionLevel(1L)
## enable parallel ops
setAutoBPPARAM(MulticoreParam(24L))

##################
## DATA PROCESS ##
##################

sce <- loadHDF5SummarizedExperiment(dir, prefix)

cts_feature_cell <- counts(sce)

M_cell_group <- DelayedArray(t(fac2sparse(sce$group)))

## this does not finish after 48 hours
cts_tx_group <- cts_feature_cell %*% M_cell_group

After some simple tests on subsets, this seemed about as good as I could get it. I set it off to run and verified that all 24 cores were running at 100%. Unfortunately, this got killed after hitting a 48 hour run time limit. I am now reassessing and wondering what further could be improved.

Should I lay out the HDF differently? Currently, it uses chunkdim=c(N_FEATURE, 1e3).

Is there something inefficient about the matrix operation?

At this point, I feel like casting the counts back to sparseMatrix in chunks, then running the sparse matrix multiplication and summing will be faster. But isn't that what DelayedArray is already doing in some sense?

HDF5Array SingleCellExperiment DelayedArray • 2.4k views

ADD COMMENT • link 2.6 years ago • updated 2.5 years ago merv ▴ 150

score 3 · Accepted Answer · 2022-09-12

3

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 7 hours ago

The city by the bay

The chunk dimensions are probably worst-case for your matrix multiplication operation. IIRC the DelayedArray multiplication implementation extracts row-wise blocks from the LHS. However, each of your HDF5 chunks spans the full set of rows. Remember that chunks must be read in their entirety, which means that the extraction of each row-wise block has to load the entire matrix into memory to obtain the specified subset of rows. This would be repeated for each block, effectively causing the implementation to read in the entire matrix N times for N blocks. Some rough math on the supplied dimensions suggests that you'd have several thousand blocks, so...

Anyway, the solution is to use more rectangular block sizes, which allow for decently efficient row-wise extraction. Or to use an algorithm that is more friendly towards column-wise extraction, though it seems that DelayedArray::colsum and scuttle::summarizeAssayByGroup are all row-based. Not too hard to do it manually, though; just blockApply across the matrix and collect a running sum for each group in each block.

ADD COMMENT • link 2.5 years ago Aaron Lun ★ 28k

0

Entering edit mode

Oh, that makes total sense! It might be several days before I get back to this, but I’ll eventually update with how things pan out.

Also, the references to existing implementations are much appreciated. Thanks for the insight!

ADD REPLY • link 2.5 years ago merv ▴ 150

0

Entering edit mode

Following back up. I reformatted to chunkdim=c(1e3L, 1e3L) and the summarization job with all else the same (24 cores, 1e9 blocks) took about 80 minutes walltime and max RAM usage of 52GB. I'm sure there's more room for tuning (e.g., cores had ~60% efficiency), but this is sufficient for now.

Thanks again for the help!

ADD REPLY • link 2.5 years ago merv ▴ 150