Question

Streamlining the computing time for MiloDE p-value correction in large dataset?

0

Entering edit mode

Joseph • 0

@678dbc2b

Last seen 25 days ago

United States

Hello,

I have a large, multi-condition single-cell dataset that I am attempting to use the miloDE package as one part of the analysis.

The object in total has >270,000 cells spanning 5 discrete treatment conditions (15 total samples). For what it is worth, I am expecting the cells used here to represent one overall cell type, so it is a fairly homogenous population.

Using a k = 27 for kNN construction, this resulted in 477 neighborhoods with an average size of a few thousand cells per neighborhood (histogram below).

My question is that I have attempted to run differential expression using the "de_test_neighborhoods" function, running just one statistical comparison at a time, but it has taken an extremely long time for the function to conclude. It seems to get 'stuck' on the p-value correction across Nhoods step, where it has sat computing this for tens of hours, even when running it on an HPC cluster. And as of now it has not concluded executing the code or creating the de_stat object.

I was wondering if this is to be expected for such a large dataset as what I have, or if there is perhaps some error in computing that could be happening. Also if anyone has advice on how I might downsample to use fewer cells while keeping rigor across multiple conditions, to streamline this, that would be great too! Thank you!

enter image description here

SingleCellExperiment edgeR miloR Bioconductor • 338 views

ADD COMMENT • link updated 4 hours ago by Kevin Blighe ★ 4.0k • written 5 weeks ago by Joseph • 0

1

Entering edit mode

Is a method that is so computationally expensive really beneficial and necessary? You seem to have biological replication, don't you, so pseudobulk aggregation would be possible. Pseudobulk has been shown to outperform single-cell DE in many setups, there is literature on that. There is also literature that specialiued single-cell methods rarely (if at all) outperform methods originally developed for bulk analysis, but run orders of magnitudes faster. MAybe consider changing method for such an extensive dataset.

ADD REPLY • link 4 weeks ago ATpoint ★ 5.0k

score 0 · Answer 1 · 2025-11-07

Hi Joseph,

I agree with ATpoint's suggestion that pseudobulk aggregation (e.g., via aggregateData in scran or similar) would likely be far more efficient here, especially with your biological replicates, and it often yields comparable or superior results to specialised single-cell methods for differential expression. The literature backs this up, as bulk-style approaches like edgeR scale much better for datasets of this size.

That said, if you wish to stick with Milo, the long run-times for de_test_neighborhoods are unfortunately expected with 477 neighbourhoods - each requiring an edgeR GLM fit, followed by p-value correction across all of them, which doesn't parallelise well by default. There's no obvious error in your description, but you could try reducing k (e.g., to 15-20) to generate fewer neighbourhoods and thus speed things up, at the potential cost of some resolution.

For downsampling while preserving rigor across conditions, since your population is homogeneous, aim to subsample proportionally within each sample to maintain relative abundances. Here's a simple way to downsample to ~50k cells total (adjust as needed):

library(SingleCellExperiment)

# Assuming 'sce' is your SingleCellExperiment, with colData(sce)$sample_id
target_n <- 10000  # cells per sample
keep <- vector("list", length(unique(colData(sce)$sample_id)))
names(keep) <- unique(colData(sce)$sample_id)

for (s in names(keep)) {
  idx <- colData(sce)$sample_id == s
  if (sum(idx) > target_n) {
    keep[[s]] <- sample(which(idx), target_n)
  } else {
    keep[[s]] <- which(idx)
  }
}

sce_down <- sce[, unlist(keep)]

This keeps representation balanced per sample/condition. Re-run your Milo setup on sce_down and see if it completes.

Best,
Kevin