Hi !
I've tried to compare the output from Cellranger 3.0 with DropletUtils::emptyDrops() in term of number of cell-containing droplets. The web-summary from the Cellranger run outputs 1240 cells which is indeed very consistant when looking at the UMI Vs ranked-barcodes plot (the drop of the curve is around the 1100-1300th barcode). It is far less than I expected with regard to the number of cells I've put as input but this could be explain by some technical reasons I guess (wrong cell counting, bad cell lysis).
However when running DropletUtils::emptyDrops() on the unfiltered Cellranger matrix.h5, I obtain very high number of cell-containing barcodes (15307 !!). This is with default parameters but I've tried to adjust and it doesn't change anything (see following code). How comes that DropletUtils::emptyDrops() outputs so different numbers than cell ranger 3.0? What should I do?
Thanks a lot !
sce <- read10xCounts("/Users/Gregoire/Desktop/Dowloaded_from_cluster/count/raw_feature_bc_matrix.h5", col.names=T)
sce
#CASE 1 use emptyDrops with default parameters : it will retain every cells above knee of UMI Vs ranked-barcode plot
counts <- counts(sce)
tic()
set.seed(100)
e.out.1 <- DropletUtils::emptyDrops(counts,BPPARAM = SerialParam())
summary(e.out.1$FDR < 0.001)
sum(e.out.1$FDR <= 0.001, na.rm=TRUE)
toc()
# RESULT = 15307
#CASE 2 use emptyDrops with retain=inf: it will retain every cells above inflection point of UMI Vs ranked-barcode plot
tic()
set.seed(100)
e.out.2 <- DropletUtils::emptyDrops(counts, retain = Inf)
summary(e.out.2$FDR < 0.001)
sum(e.out.2$FDR <= 0.001, na.rm=TRUE)
toc()
# RESULT = 15307
#CASE 3 use emptyDrops on counts matrix without ribosomal or mitochondrial genes that could be captures in non-cell-containing drops
qc <- list(Mito = grep("^MT-",rowData(sce)$Symbol),
Ribo = grep("^RPS",rowData(sce)$Symbol)
)
keep <- !row.names(counts) %in% unlist(qc)
tic()
set.seed(100)
e.out.3 <- DropletUtils::emptyDrops(counts[keep, ])
summary(e.out.3$FDR < 0.001)
sum(e.out.3$FDR <= 0.001, na.rm=TRUE)
toc()
# RESULT = 15307
#CASE 4 use emptyDrops with lower =200 : in case the knee is artificially put in small bump of the curve at high ranked-barcode position
tic()
set.seed(100)
e.out.4 <- DropletUtils::emptyDrops(counts,lower =200)
summary(e.out.4$FDR < 0.001)
sum(e.out.4$FDR <= 0.001, na.rm=TRUE)
toc()
# RESULT = 5960
I think
emptyDrops
function will always retain more cells as compared to the cell ranger pipeline as shown here. You can check usingSeurat
QC functions if the additional cells that are being rescued are actual cells or doublets. If you want to get similar number of cells you can usedefaultDrops
function ofDropletUtils
package as mentioned here. But usingdefaultDrops
will kill the entire purpose of using raw matrices and performing custom filtering.Aron is a wonderful developer and he always write functions which require less user intervention in function arguments. So stick to default if you are not sure what you are doing. Besides, the aforementioned link states:
I use the following function for most of my datasets:
If you have data where the low number of genes are expressed (like single cells at ring stage of Plasmodium) altering the
lower
parameter would be necessary. However, usually the default threshold works pretty well!! I see that you are filtering a lot of cells using thelower
argument, and I would like to point you toward this extensive discussion on the effect of usage of this parameter here