Question

Scalability of emptyDrops() running times

0

Entering edit mode

rosano • 0

@rosano-20689

Last seen 4.7 years ago

Hi all,

Currently I am using emptyDrops() for calling cells after applying swappedDrops() to perform barcode swapping removal of several 10X scRNA datasets. After experiencing long running times for emptyDrops() on these data, I was wondering about the scalability of the function's performance.

As mentioned in [1], emptyDrops() requires approximately 1-2 minutes to run on each of the tested datasets, and this was confirmed with example dataset "placenta1" (dimension of 33,694 features x 737,280 barcodes), which took around 75 seconds to complete. However, when using emptyDrops() on my datasets of interest, this is taking much longer than expected, eg. for a dataset of dimension 33,538 features x 737,280 barcodes (ie. a total of 156 fewer genes), the running time is around 245 seconds. When attempting to clarify this difference, I also looked at the degree of "sparsity" of each dataset, and while "placenta1" has 18,763,564 non-zero elements, my example dataset had 10,793,183. Do any of this factors (dimensions, sparsity, etc.) or others influence the running time of emptyDrops()? Is it possible to reduce it somehow? I apply this function several times over several datasets, therefore my interest in the matter.

Thank you in advance!

[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6431044/

dropletutils emptydrops • 1.1k views

ADD COMMENT • link updated 5.2 years ago by Aaron Lun ★ 28k • written 5.2 years ago by rosano • 0

score 0 · Answer 1 · 2019-05-17

The time complexity of emptyDrops() is something like:

[Niters * log(Ngenes) * max(Ntotals)] * C1 + [Niters * log(Ncells)] * C2

That's probably not completely correct, but it's close enough. The max(Ntotals) is the largest total count across all cell barcodes; the other terms should be pretty obvious. The weirdness of the expression is because of some tricks it uses to avoid computing Monte Carlo p-values for each cell separately; such a naive approach would give you something closer to Niters * Ngenes * Ncells.

My best guess is that the maximum total count differs between your data sets. Beyond that, there are all sorts of very specific issues like cache locality and the pattern of values in the binary search, which make it difficult to predict the cause of differences in speed. For example... are the highest-expressing genes in adjacent rows of the matrix? Are the highest-expressing genes occupying the middle rows of the matrix?

If it's not fast enough for you, just throw it onto more cores with BPPARAM. Given a seed, you should get exactly the same results regardless of the number of cores that you use.

P.S. I pushed the locality branch to the DropletUtils GitHub repository, which solves a few of the locality issues. It gives me a modest (15%) speed-up on the PBMC 4K data set. You can try it out and see if it's any faster on your data set (just run BiocManager::install("MarioniLab/DropletUtils@locality") - but don't use it for anything other than timing, at least for the time being).

P.P.S. locality has been merged, so BiocManager::install("MarioniLab/DropletUtils") should get you the development version of the package that contains this modification.