emptyDrops identified less non-empty cells with lower total UMI counts?
1
0
Entering edit mode
C T ▴ 140
@c-t-5858
Last seen 18 months ago
United States

Hello,

I hope someone can help me understand this better. I am running the emptyDrops method from DropletUtils library which distinguishes empty from non-empty cells. The function accepts lower parameter which specify the lower bound on the total UMI count, at or below which all barcodes are assumed to correspond to empty droplets. Link to manual.

Based on this, I thought that if I specified small number for lower, it will detect more non-empty cells and if used bigger number for lower, it will detect less non-empty cells. However, it didn't. In fact it is consistently detect more non-empty cells when lower is set to larger number. Am I missing something? Here is the code to reproduce the results.

library(DropletsUtils)
library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask = FALSE)
raw.path <- bfcrpath(bfc, file.path("http://cf.10xgenomics.com/samples",
 "cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz"))
untar(raw.path, exdir=file.path(tempdir(), "pbmc4k"))
library(DropletUtils)
fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38")
sce <- read10xCounts(fname, col.names=TRUE)

set.seed(100)
e.out <- emptyDrops(counts(sce),lower=100)
sum(e.out$FDR <= 0.001, na.rm=TRUE)
[1] 4237

set.seed(100)
e.out.200 <- emptyDrops(counts(sce),lower=200)
sum(e.out.200$FDR <= 0.001, na.rm=TRUE)
[1] 4346

set.seed(100)
e.out.1000 <- emptyDrops(counts(sce),lower=1000)
sum(e.out.1000$FDR <= 0.001, na.rm=TRUE)
[1] 4326

set.seed(123)
e.out.test <- emptyDrops(counts(sce),lower=100)
sum(e.out.test$FDR <= 0.001, na.rm=TRUE)
[1] 4223

set.seed(123)
e.out.test <- emptyDrops(counts(sce),lower=200)
sum(e.out.test$FDR <= 0.001, na.rm=TRUE)
[1] 4316

set.seed(123)
e.out.test <- emptyDrops(counts(sce),lower=1000)
sum(e.out.test$FDR <= 0.001, na.rm=TRUE)
[1] 4324
DropletUtils scRNA-seq emptyDrops simplesinglecell cell calling • 2.2k views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 2 hours ago
The city by the bay

lower is used to estimate the "expression" profile of the ambient pool; changing lower just means that you're using more-or-less low-count libraries to do this estimation. Provided all of the low count libraries are empty droplets, the ambient proportion estimates should not change much from using more libraries. The fact that the number of detected cells is stable is actually encouraging as it means that the analysis is robust to the choice of lower, i.e., exactly how empty droplets are defined doesn't matter much.

A secondary effect is that libraries with counts below lower are not tested by default, as we're already assuming that they're empty droplets (so why bother testing them?). This means that there is a small reduction in the multiple testing burden as you increase lower. One would expect this to result in a slight increase in the number of detected cells, though this really depends on whether you start throwing away more cells with total counts below lower, and in this case, it all comes out in the wash.

P.S. I'll give you the benefit of the doubt and assume you're not the same person who posted this, but if you are, it is usually good etiquette to wait a reasonable timeframe before reposting.

ADD COMMENT
0
Entering edit mode

Hi Aaron,

Thank you very much for your clear explanation. I have one more related question: in your f1000research paper, it didn't have cell calling step. It makes me wonder whether there are specific cases where you don't need to do cell calling. Thank you!

ADD REPLY
1
Entering edit mode

Well, y'know, back in my day, we didn't have fancy droplet methods for scRNA-seq. We had microwell plates. Plates, and a pipette, for the entire lab. And we had to share the pipette!

More seriously, all of the data in the paper was taken from plate-based methods (or close to it, e.g., C1), where we were pretty sure of getting a cell from each library. So there wasn't any need for an explicit cell calling step, though obviously quality control was still required to protect against damaged cells or cell fragments. You might get an occasional accidentally empty well, but that was very much a minority and would be caught by QC using outliers. By comparison, in droplet-based methods, the empty droplets are the majority so an outlier-based QC method would fail.

ADD REPLY
0
Entering edit mode

Based on this discussion is the following statement correct?

Lower lower values would be more "reliable" on what one confidently thinks are empty droplets, and more "flexible" on what one thinks might be real cells with low library sizes.

ADD REPLY
1
Entering edit mode

"Conservative" would be a better word. Yes, a lower lower would be less likely to (i) remove genuine cells and (ii) accidentally include them in the estimate of the ambient profile. But I'm not sure I would set this value too low because you still want to be able to get enough counts to be able to stably estimate the ambient profile. There are also some other nagging suspicions:

  • Cell barcode misassignment artifacts. A very low lower favors barcodes that have, e.g., only 1 or 2 counts. I have often wondered whether these transcripts really originate from droplets corresponding to those barcodes or if they are just the result of sequencing errors in the barcode itself, possibly resulting in counts "bleeding" over from cell-containing libraries. This effect, if present, would distort all ambient estimates, but a high lower does protect a little against that.
  • Cell fragments in otherwise empty droplets. A low lower yields an estimate of the ambient profile that excludes droplets with small cell fragments. This effect is not exactly wrong - the droplet isn't really empty, after all - but it does mean that the resulting tests in emptyDrops() will reject the null a lot more frequently that it should. I think I observe this in some datasets even with the default lower, but it might get more pronounced if one was to lower it.

You could try to lower lower but the default seems to work pretty well. Check the $p$-value histogram with test.ambient=TRUE as discussed here.

ADD REPLY

Login before adding your answer.

Traffic: 411 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6