Hello,
I hope someone can help me understand this better. I am running the emptyDrops method from DropletUtils library which distinguishes empty from non-empty cells. The function accepts lower parameter which specify the lower bound on the total UMI count, at or below which all barcodes are assumed to correspond to empty droplets. Link to manual.
Based on this, I thought that if I specified small number for lower, it will detect more non-empty cells and if used bigger number for lower, it will detect less non-empty cells. However, it didn't. In fact it is consistently detect more non-empty cells when lower is set to larger number. Am I missing something? Here is the code to reproduce the results.
library(DropletsUtils)
library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask = FALSE)
raw.path <- bfcrpath(bfc, file.path("http://cf.10xgenomics.com/samples",
"cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz"))
untar(raw.path, exdir=file.path(tempdir(), "pbmc4k"))
library(DropletUtils)
fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38")
sce <- read10xCounts(fname, col.names=TRUE)
set.seed(100)
e.out <- emptyDrops(counts(sce),lower=100)
sum(e.out$FDR <= 0.001, na.rm=TRUE)
[1] 4237
set.seed(100)
e.out.200 <- emptyDrops(counts(sce),lower=200)
sum(e.out.200$FDR <= 0.001, na.rm=TRUE)
[1] 4346
set.seed(100)
e.out.1000 <- emptyDrops(counts(sce),lower=1000)
sum(e.out.1000$FDR <= 0.001, na.rm=TRUE)
[1] 4326
set.seed(123)
e.out.test <- emptyDrops(counts(sce),lower=100)
sum(e.out.test$FDR <= 0.001, na.rm=TRUE)
[1] 4223
set.seed(123)
e.out.test <- emptyDrops(counts(sce),lower=200)
sum(e.out.test$FDR <= 0.001, na.rm=TRUE)
[1] 4316
set.seed(123)
e.out.test <- emptyDrops(counts(sce),lower=1000)
sum(e.out.test$FDR <= 0.001, na.rm=TRUE)
[1] 4324
Hi Aaron,
Thank you very much for your clear explanation. I have one more related question: in your f1000research paper, it didn't have cell calling step. It makes me wonder whether there are specific cases where you don't need to do cell calling. Thank you!
Well, y'know, back in my day, we didn't have fancy droplet methods for scRNA-seq. We had microwell plates. Plates, and a pipette, for the entire lab. And we had to share the pipette!
More seriously, all of the data in the paper was taken from plate-based methods (or close to it, e.g., C1), where we were pretty sure of getting a cell from each library. So there wasn't any need for an explicit cell calling step, though obviously quality control was still required to protect against damaged cells or cell fragments. You might get an occasional accidentally empty well, but that was very much a minority and would be caught by QC using outliers. By comparison, in droplet-based methods, the empty droplets are the majority so an outlier-based QC method would fail.
Based on this discussion is the following statement correct?
Lower
lower
values would be more "reliable" on what one confidently thinks are empty droplets, and more "flexible" on what one thinks might be real cells with low library sizes."Conservative" would be a better word. Yes, a lower
lower
would be less likely to (i) remove genuine cells and (ii) accidentally include them in the estimate of the ambient profile. But I'm not sure I would set this value too low because you still want to be able to get enough counts to be able to stably estimate the ambient profile. There are also some other nagging suspicions:lower
favors barcodes that have, e.g., only 1 or 2 counts. I have often wondered whether these transcripts really originate from droplets corresponding to those barcodes or if they are just the result of sequencing errors in the barcode itself, possibly resulting in counts "bleeding" over from cell-containing libraries. This effect, if present, would distort all ambient estimates, but a highlower
does protect a little against that.lower
yields an estimate of the ambient profile that excludes droplets with small cell fragments. This effect is not exactly wrong - the droplet isn't really empty, after all - but it does mean that the resulting tests inemptyDrops()
will reject the null a lot more frequently that it should. I think I observe this in some datasets even with the defaultlower
, but it might get more pronounced if one was to lower it.You could try to lower
lower
but the default seems to work pretty well. Check the $p$-value histogram withtest.ambient=TRUE
as discussed here.