Question

Expected memory usage for analyzing large single-cell seq datasets (> 450,000 cells)?

0

Entering edit mode

Joseph • 0

@678dbc2b

Last seen 7 weeks ago

United States

Hi all, so I have assembled a rather large single cell dataset that I am working on analyzing right now. The dataset consists of 15 samples, spread across 5 different treatment conditions (N = 3 samples / condition). After QC, etc I have a whopping 450,000 cells to play with. The ultimate goal is to identify small cell subtypes and perform pseudobulking to determine gene expression differences between the treatments.

I have opted to use BioC for the analysis, and my SingleCellExperiment Object is ~23 Gb including dimensional reduction etc.

I am currently attempting to perform kNN clustering, but when I run the clusterCells() function, it is consuming an absurd amount of memory (I attempted to run on a HPC cluster and got kicked off wen I used >450 Gb of RAM...)

As someone who is relatively new to this analysis, is it expected that this much memory would be used during steps like clustering, or pseudobulking? Or if not, are there any hints as to why the function may be causing this issue? here is the code:

clusterCells(SCE_all, use.dimred="PCA", BLUSPARAM=NNGraphParam(k=30, cluster.fun = "louvain"))

And here is the info for my SCE:

class: SingleCellExperiment 
dim: 30711 470269 
metadata(0):
assays(2): counts logcounts
rownames(30711): Tbx2 Tbx4 ... Unknown Unknown
rowData names(0):
colnames: NULL
colData names(11): Sample Barcode ... TreatmentGroup sizeFactor
reducedDimNames(1): PCA
mainExpName: NULL
altExpNames(0):

Happy to provide any other code, I just figured asking the question more generally first would be helpful. Thanks :)

BigData Clustering memory SingleCellExperiment edgeR • 2.1k views

ADD COMMENT • link updated 16 months ago by Aaron Lun ★ 29k • written 16 months ago by Joseph • 0

score 2 · Answer 1 · 2024-08-09

I used to compile the OSCA book(s) on my laptop with 16 GB RAM. One of the workflow chapters involved using clusterCells() with 250k cells, and this ran fine, though it did require some approximation in the graph clustering step with the TwoStepParam. I also processed the 1.3 million brain cell dataset in (IIRC) 60 GB of RAM - most of this was spent in dragging the data out of the TENxBrainData package, and I would speculate that the steps after PCA (including graph-based clustering) would have taken less than 20 GB of RAM.

We can also do the math for the expected theoretical memory usage of graph-based clustering. With k = 30, the maximum number of edges per node is 900 as we're using shared neighbors - each cell can have 30 neighbors, each of which have up to 30 neighbors, so we end up with a maximum of 900 edges from each cell to all other cells that it shares a neighbor with. The maximum number of edges in the graph is 900 * 450e3. Now, the igraph R package was probably compiled with 64-bit integers for its igraph_integer_t and a double-precision float for its igraph_real_t, so each edge requires 24 bytes; that puts us at 9 GB memory usage for the entire graph. Actual memory usage is likely to be a multiple of this number as copies may be created inside igraph functions, at least when moving from R to C++ and back again. But all in all, this should be quite a bit less than the 450 GB you're reporting, at least in clusterCells(). (Memory usage of the PCs and neighbor search results should be relatively small by comparison, so we won't consider that here.)

From experience, I have found that a consistent way to get jobs killed on the HPC is to use R's parallelization mechanisms, in particular forking (e.g., via MulticoreParam). I don't recall the exact reasoning, but it's something to do with all the child processes believing that they have full access to all of the memory that was allocated to the parent. However, this is only virtual memory, so when the child actually tries to write to it, the operating system has to provide new memory allocations. This causes the actual memory usage to multiply by the number of children - especially if the parallel code is actively processing large data (e.g., DelayedArray block processing) - and flags the job for an out-of-memory kill. That said, it's worth noting that tracking memory usage with forking can be quite confusing; the apparent usage (if you add up the reported memory usage of all the children) may be quite a bit more than the actual usage (of the entire R job), as read-only memory allocations remain shared between children but are reported separately for each child.