Question

emptyDrops new default alpha = Inf giving far fewer non-empty droplets than under old default of alpha = NULL

0

Entering edit mode

Peter Hickey ▴ 760

@petehaitch

Last seen 21 days ago

WEHI, Melbourne, Australia

I have a 10x 5' scRNA-seq dataset where Cell Ranger (v9.0.0) reports 64,610 non-empty droplets. This is plausible since it comes from an hashtagged and superloaded GEM-X capture.

I always run DropletUtils::emptyDrops() by way of comparison, and here got very different results: 27,739 non-empty droplets. Knowing that the default value of the alpha = Inf parameter changed recently-ish, I repeated this with the former default alpha = NULL and recovered numbers comparable to Cell Ranger: 72,162 non-empty droplets. I'm trying to understand which approach to put more weight on, and conducted some experiments (copied below). I can share the count matrix, (x; ~300 MB) if it would be helpful. Any suggestions and explanations appreciated.

Thanks, Pete

library(DropletUtils)

x <- readRDS("x.rds")

bcrank <- barcodeRanks(x)
uniq <- !duplicated(bcrank$rank)
plot(
  bcrank$rank[uniq],
  bcrank$total[uniq],
  log = "xy",
  xlab = "Rank",
  ylab = "Total UMI count",
  cex.lab = 1.2)
#> Warning in xy.coords(x, y, xlabel, ylabel, log): 1 y value <= 0 omitted from
#> logarithmic plot
abline(h = metadata(bcrank)$inflection, col = "darkgreen", lty = 2)
abline(h = metadata(bcrank)$knee, col = "dodgerblue", lty = 2)
legend(
  "bottomleft",
  legend = c("Inflection", "Knee"),
  col = c("darkgreen", "dodgerblue"),
  lty = 2,
  cex = 1.2)


# emptyDrops under default settings --------------------------------------------

lower <- 100
set.seed(666)
ed_alpha_inf <- emptyDrops(x, lower = lower, test.ambient = TRUE)
ed_alpha_null <- emptyDrops(x, lower = lower, test.ambient = TRUE, alpha = NULL)

sum(ed_alpha_inf$FDR <= 0.001, na.rm = TRUE)
#> [1] 27739
sum(ed_alpha_null$FDR <= 0.001, na.rm = TRUE)
#> [1] 72162

par(mfrow = c(1, 2))
hist(
  ed_alpha_inf$PValue[ed_alpha_inf$Total <= lower & ed_alpha_inf$Total > 0],
  xlab = "P-value",
  main = "alpha = Inf",
  col = "grey80")
hist(
  ed_alpha_null$PValue[ed_alpha_null$Total <= lower & ed_alpha_null$Total > 0],
  xlab = "P-value",
  main = "alpha = NULL",
  col = "grey80")


# emptyDrops under expectation of ~60,000 non-empty droplets -------------------

by.rank <- 70000
set.seed(666)
ed_alpha_inf <- emptyDrops(x, by.rank = by.rank, test.ambient = TRUE)
ed_alpha_null <- emptyDrops(x, by.rank = by.rank, test.ambient = TRUE, alpha = NULL)
sum(ed_alpha_inf$FDR <= 0.001, na.rm = TRUE)
#> [1] 30369
sum(ed_alpha_null$FDR <= 0.001, na.rm = TRUE)
#> [1] 49461

# NOTE: Extract lower (doesn't depend on alpha)
lower <- metadata(ed_alpha_inf)$lower
lower
#> [1] 992
par(mfrow = c(1, 2))
hist(
  ed_alpha_inf$PValue[
    ed_alpha_inf$Total <= lower & ed_alpha_inf$Total > 0],
  xlab = "P-value",
  main = "alpha = Inf",
  col = "grey80")
hist(
  ed_alpha_null$PValue[
    ed_alpha_null$Total <= lower & ed_alpha_null$Total > 0],
  xlab = "P-value",
  main = "alpha = NULL",
  col = "grey80")


# More extreme version of the above --------------------------------------------

by.rank <- 100000
set.seed(666)
ed_alpha_inf <- emptyDrops(x, by.rank = by.rank, test.ambient = TRUE)
ed_alpha_null <- emptyDrops(x, by.rank = by.rank, test.ambient = TRUE, alpha = NULL)
sum(ed_alpha_inf$FDR <= 0.001, na.rm = TRUE)
#> [1] 36717
sum(ed_alpha_null$FDR <= 0.001, na.rm = TRUE)
#> [1] 65752

# NOTE: Extract lower (doesn't depend on alpha)
lower <- metadata(ed_alpha_inf)$lower
lower
#> [1] 220
par(mfrow = c(1, 2))
hist(
  ed_alpha_inf$PValue[
    ed_alpha_inf$Total <= lower & ed_alpha_inf$Total > 0],
  xlab = "P-value",
  main = "alpha = Inf",
  col = "grey80")
hist(
  ed_alpha_null$PValue[
    ed_alpha_null$Total <= lower & ed_alpha_null$Total > 0],
  xlab = "P-value",
  main = "alpha = NULL",
  col = "grey80")


# emptyDropsCellRanger() -------------------------------------------------------

# NOTE: Unsure how closely this actually mimics what Cell Ranger is doing, but
#       including because it's easy enough to do.
ed_cr <- emptyDropsCellRanger(x, n.expected.cells = 60000)
sum(ed_cr$FDR <= 0.001, na.rm = TRUE)
#> [1] 63190

scrn 10x scRNAseq emptyDrops DropletUtils • 994 views

ADD COMMENT • link 5 months ago • updated 4 months ago Peter Hickey ▴ 760

0

Entering edit mode

Session info

> sessionInfo()
R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: Red Hat Enterprise Linux 9.3 (Plow)

Matrix products: default
BLAS:   /stornext/System/data/software/rhel/9/base/tools/R/4.5.1/lib64/R/lib/libRblas.so 
LAPACK: /stornext/System/data/software/rhel/9/base/tools/R/4.5.1/lib64/R/lib/libRlapack.so;  LAPACK version 3.12.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Australia/Melbourne
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods  
[8] base     

other attached packages:
 [1] DropletUtils_1.28.1         SingleCellExperiment_1.30.1
 [3] SummarizedExperiment_1.38.1 Biobase_2.68.0             
 [5] GenomicRanges_1.60.0        GenomeInfoDb_1.44.1        
 [7] IRanges_2.42.0              S4Vectors_0.46.0           
 [9] BiocGenerics_0.54.0         generics_0.1.4             
[11] MatrixGenerics_1.20.0       matrixStats_1.5.0          

loaded via a namespace (and not attached):
 [1] xfun_0.52                 processx_3.8.6            rhdf5_2.52.1             
 [4] lattice_0.22-7            callr_3.7.6               rhdf5filters_1.20.0      
 [7] vctrs_0.6.5               tools_4.5.1               ps_1.9.1                 
[10] parallel_4.5.1            tibble_3.3.0              R.oo_1.27.1              
[13] pkgconfig_2.0.3           Matrix_1.7-3              dqrng_0.4.1              
[16] sparseMatrixStats_1.20.0  lifecycle_1.0.4           GenomeInfoDbData_1.2.14  
[19] compiler_4.5.1            statmod_1.5.0             codetools_0.2-20         
[22] htmltools_0.5.8.1         yaml_2.3.10               pillar_1.11.0            
[25] crayon_1.5.3              R.utils_2.13.0            BiocParallel_1.42.1      
[28] limma_3.64.3              DelayedArray_0.34.1       abind_1.4-8              
[31] locfit_1.5-9.12           digest_0.6.37             fastmap_1.2.0            
[34] grid_4.5.1                cli_3.6.5                 SparseArray_1.8.1        
[37] magrittr_2.0.3            S4Arrays_1.8.1            h5mread_1.0.1            
[40] edgeR_4.6.3               withr_3.0.2               DelayedMatrixStats_1.30.0
[43] UCSC.utils_1.4.0          rmarkdown_2.29            XVector_0.48.0           
[46] httr_1.4.7                R.methodsS3_1.8.2         beachmat_2.24.0          
[49] HDF5Array_1.36.0          evaluate_1.0.4            knitr_1.50               
[52] rlang_1.1.6               Rcpp_1.1.0                scuttle_1.18.0           
[55] glue_1.8.0                BiocManager_1.30.26       renv_1.1.5               
[58] reprex_2.1.1              rstudioapi_0.17.1         jsonlite_2.0.0           
[61] R6_2.6.1                  Rhdf5lib_1.30.0           fs_1.6.6

ADD REPLY • link 5 months ago Peter Hickey ▴ 760

score 0 · Answer 1 · 2025-08-07

0

Entering edit mode

Aaron Lun ★ 29k

@alun

Last seen 10 hours ago

The city by the bay

I suspect the real problem is that the knee point is not being identified correctly in your plot. Visually, I would put the knee point somewhere at the midpoint between the blue and green lines, which should give you something along the lines of "must keep 50000 cells", though it's hard to say with the log scale.

To recap, emptyDrops() will keep all droplets above the knee point, to ensure that high-coverage droplets are retained even if they look similar to the ambient solution. This is designed to maintain detection power even for homogeneous cell populations where the ambient solution might be just a copy of the population (and thus the null hypothesis would never be rejected by the multinomial model).

Unfortunately, knee point detection is somewhat heuristic. There is a proper way to do it based on maximizing the curvature, but it depends on the second derivative and is unstable for empirically-fitted curves. So in https://github.com/MarioniLab/DropletUtils/pull/117, we switched to a much simpler algorithm whereby we draw a line from the 50th cell (see exclude.from= in barcodeRanks()) to the inflection point, and set the knee point to the location on the curve that maximizes the vertical distance from the line.

This approach worked pretty well on most of my datasets (see the results in #117), but in your case, it doesn't quite pick the right location. I think this is because you have so many cells that the 50th cell is quite far away from the inflection point. This gives an opportunity for other local maxima to be considered, and in your case, the chosen location is quite far away from the visual knee point. I'd guess that this is still a knee point of some sort - one can see a bit of lump in the curve - but it's not the one that gives us most detected cells.

The change in alpha= just exposes the real problem. Previously, the anticonservativeness of alpha=NULL (see https://github.com/MarioniLab/DropletUtils/pull/118) was covering up the conservative choice of the knee point. Now that the p-values are properly sized, the consequences of the earlier knee point are more apparent.

The solution might be as simple as picking a better left-side point to define our line. One idea might be to take the x-coordinate of the inflection point, divide it by 10, and use that location on the curve to define our line to the inflection point. This restricts the knee point identification to the interval immediately preceding the inflection point, and ensures that the knee point identification doesn't get distracted by other maxima at lower x-coordinates. Probably something like the code below, see if it works for you:

library(DropletTestFiles)
path <- getTestFile("tenx-3.1.0-5k_pbmc_protein_v3/1.0.0/raw.h5")

library(DropletUtils)
se <- read10xCounts(path, type="HDF5")
mat <- as(assay(se), "SVT_SparseMatrix")

# Paraphrased from barcodeRanks.
totals <- colSums(mat)
o <- order(totals, decreasing=TRUE)
stuff <- rle(totals[o])
run.rank <- cumsum(stuff$lengths) - (stuff$lengths-1)/2
run.totals <- stuff$values
keep <- run.totals > 100 # i.e., lower=100
y <- log10(run.totals[keep])
x <- log10(run.rank[keep])

edge.out <- DropletUtils:::.find_curve_bounds(x=x, y=y, exclude.from=50)
left.edge <- edge.out["left"]
right.edge <- edge.out["right"]
inflection <- 10^(y[right.edge])

new.left.rank <- x[right.edge] - 1 # -1 as we're in log10 units already.
new.left.edge <- findInterval(new.left.rank, x)
new.keep <- new.left.edge:right.edge

curx <- x[new.keep]
cury <- y[new.keep]
xbounds <- curx[c(1L, length(new.keep))]
ybounds <- cury[c(1L, length(new.keep))]
gradient <- diff(ybounds)/diff(xbounds)
intercept <- ybounds[1] - xbounds[1] * gradient
above <- which(cury >= curx * gradient + intercept)
dist <- abs(gradient * curx[above] - cury[above] + intercept)/sqrt(gradient^2 + 1)
knee <- 10^(cury[above[which.max(dist)]])

o <- order(current$rank)
plot(current$rank[o], current$total[o], log="xy", type="l")
abline(h=knee, lty=2)
abline(h=infl, lty=3)

FWIW emptyDropsCellranger() defines its "must keep" threshold differently - namely, using the n.expected.cells and max.percentile arguments - so it doesn't suffer from the knee point problem.

ADD COMMENT • link 5 months ago Aaron Lun ★ 29k

0

Entering edit mode

Ah, I'd forgotten about how the knee comes into the calculation.

Your proposal does do a better job at finding the knee and the subsequent number of non-empty cells is more in line with expectations (albeit still ~10k fewer non-empty droplets than Cell Ranger which I haven't tried yet to account for).

suppressPackageStartupMessages(library(DropletUtils))
mat <- readRDS("x.rds")

# Aaron's proposal for picking a better left-side point to define our line -----

totals <- colSums(mat)
o <- order(totals, decreasing=TRUE)
stuff <- rle(totals[o])
run.rank <- cumsum(stuff$lengths) - (stuff$lengths-1)/2
run.totals <- stuff$values
lower <- 100 # Making next line explicit
keep <- run.totals > lower
y <- log10(run.totals[keep])
x <- log10(run.rank[keep])

edge.out <- DropletUtils:::.find_curve_bounds(x=x, y=y, exclude.from=50)
# NOTE: Names contains droplet barcode; ignore.
left.edge <- edge.out[grep("left", names(edge.out))]
right.edge <- edge.out[grep("right", names(edge.out))]
inflection <- 10^(y[right.edge])

new.left.rank <- x[right.edge] - 1 # -1 as we're in log10 units already.
new.left.edge <- findInterval(new.left.rank, x)
new.keep <- new.left.edge:right.edge

curx <- x[new.keep]
cury <- y[new.keep]
xbounds <- curx[c(1L, length(new.keep))]
ybounds <- cury[c(1L, length(new.keep))]
gradient <- diff(ybounds)/diff(xbounds)
intercept <- ybounds[1] - xbounds[1] * gradient
above <- which(cury >= curx * gradient + intercept)
dist <- abs(gradient * curx[above] - cury[above] + intercept)/sqrt(gradient^2 + 1)
knee <- 10^(cury[above[which.max(dist)]])

current <- barcodeRanks(mat)
uniq <- !duplicated(current$rank)

plot(
  current$rank[uniq],
  current$total[uniq],
  log = "xy",
  xlab = "Rank",
  ylab = "Total UMI count",
  cex.lab = 1.2)
#> Warning in xy.coords(x, y, xlabel, ylabel, log): 1 y value <= 0 omitted from
#> logarithmic plot
abline(h = metadata(current)$inflection, col = "darkgreen", lty = 2)
# NOTE: Inflection is defined as before.
abline(h = inflection, col = "darkgreen",  lty = 1)
abline(h = metadata(current)$knee, col = "dodgerblue", lty = 2)
abline(h = knee, col = "dodgerblue", lty = 1)
legend(
  "bottomleft",
  legend = c("Inflection", "New inflection", "Knee", "New knee"),
  col = c("darkgreen", "darkgreen", "dodgerblue", "dodgerblue"),
  lty = c(2, 1, 2, 1),
  cex = 1.2)


metadata(current)$inflection
#> [1] 589
inflection
#> 2_GTCCATGCACTAAAGG-1 
#>                  589
metadata(current)$knee
#> [1] 6328
knee
#> 2_GTTCAGCAGAATGGGT-1 
#>                 2313

# emptyDrops under default settings but provided new knee ----------------------

set.seed(666)
ed_alpha_inf <- emptyDrops(mat, retain = knee,  test.ambient = TRUE)
ed_alpha_null <- emptyDrops(mat, retain = knee, test.ambient = TRUE, alpha = NULL)
metadata(ed_alpha_null)$alpha
#> [1] 908.6222

sum(ed_alpha_inf$FDR <= 0.001, na.rm = TRUE)
#> [1] 53553
sum(ed_alpha_null$FDR <= 0.001, na.rm = TRUE)
#> [1] 73030

par(mfrow = c(1, 2))
hist(
  ed_alpha_inf$PValue[ed_alpha_inf$Total <= lower & ed_alpha_inf$Total > 0],
  xlab = "P-value",
  main = "alpha = Inf",
  col = "grey80")
hist(
  ed_alpha_null$PValue[ed_alpha_null$Total <= lower & ed_alpha_null$Total > 0],
  xlab = "P-value",
  main = "alpha = NULL",
  col = "grey80")

ADD REPLY • link 5 months ago Peter Hickey ▴ 760

0

Entering edit mode

It's all heuristic anyway, so it's not surprising to give or take a few thousand cells. I suspect you'd get that much variation if you fiddled with n.expected.cells or max.percentile.

In any case, you can try out the PR here, which is slightly different from my proposal above but should give a more aggressive knee point definition.

ADD REPLY • link 5 months ago Aaron Lun ★ 29k

0

Entering edit mode

Thanks, Aaron. The PR returns a better looking knee-point than the current default on my data (knee = 2363). knee-point from 124

ADD REPLY • link 4 months ago Peter Hickey ▴ 760

0

Entering edit mode

I assume that plot was for the previous algorithm, not the PR version; the dashed line doesn't look like it's ~2.3k.

Anyway, the PR is merged and should be available in 1.29.5.

ADD REPLY • link 4 months ago Aaron Lun ★ 29k

0

Entering edit mode

Ah you're right. Here it is using v1.29.5

suppressPackageStartupMessages(library(DropletUtils))
packageVersion('DropletUtils')
#> [1] '1.29.5'

mat <- readRDS("~/Downloads/mat.rds")

current <- barcodeRanks(mat)
o <- order(current$rank)
plot(current$rank[o], current$total[o], log="xy", type="l")
#> Warning in xy.coords(x, y, xlabel, ylabel, log): 1090888 y values <= 0 omitted
#> from logarithmic plot
abline(h=metadata(current)$knee, lty=2)
abline(h=metadata(current)$inflection, lty=3)

metadata(current)$knee
#> [1] 2363

ADD REPLY • link 4 months ago Peter Hickey ▴ 760

0

Entering edit mode

Could you share the output of barcodeRanks()? I'd like to continue fiddling with the knee point detection.

ADD REPLY • link 4 months ago Aaron Lun ★ 29k

0

Entering edit mode

Sure: https://www.dropbox.com/scl/fi/8j4ztfh539oczvb1i0p1x/barcodeRanks.rds?rlkey=dwi7hjxtvw0kp7duve16zei9x&st=5pu3tq7o&dl=0

ADD REPLY • link 4 months ago Peter Hickey ▴ 760

0

Entering edit mode

See https://github.com/MarioniLab/DropletUtils/pull/125 for even more aggressive knee point detection, more cells, etc.

ADD REPLY • link 4 months ago Aaron Lun ★ 29k

0

Entering edit mode

Thanks, Aaron. That looks good in by brief testing. I'll apply it to some incoming datasets in the next week or so. These forthcoming datasets are additional batches from the same experiment, so should have similar expected cell numbers and similar-ish ambient RNA levels).

ADD REPLY • link 4 months ago Peter Hickey ▴ 760