Question

`emptyDrops()` calling 'too many' non-empty droplets

2

Entering edit mode

Peter Hickey ▴ 750

@petehaitch

Last seen 7 days ago

WEHI, Melbourne, Australia

(Apologies, this got rather long ...)

I've encountered a situation where DropletUtils::emptyDrops() was calling 'too many' non-empty droplets and would appreciate some advice. Initially, I noticed this affecting 1/60 captures in a large scRNA-seq experiment I'm analysing. But upon investigating the issue further I'm wondering whether I should also revise how non-empty droplets are being identified in the other 59/60 captures (and more generally in other similar-ish experiments).

I found similarities to a previously reported issue ('emptyDrops behaviour when excluding mitochondrial/ribosomal genes') whereby the performance of emptyDrops(), and in particular the identification of the knee point in the barcode rank plot, was improved by excluding mitochondrial and ribosomal protein genes (this is now documented in the 'Non-empty droplets versus cells' section of the man page for emptyDrops()). In my case, the knee point is being properly identified, but there are still too many non-empty droplets. Another point of difference to the earlier report is that the genes driving the differences between the retained low-count barcodes and discarded barcodes in this capture are T-cell receptor (TCR) genes, which I suspect may suggest the need for assay-specific or sample-specific gene filtering.

Some details of my experiment:

Human PBMCs profiled with 10x Genomics 5' chemistry (v2), including gene expression (GEX), a large panel (BioLegend's TotalSeq™-C Human Universal Cocktail) of antibody derived tags (ADTs), and VDJ sequencing of T-cell receptors (TCRs) and B-cell receptors (BCRs).
12 batches (named 3-14 for reasons), with each batch containing 5 captures, and targeting ~20,000 cells/capture.
Within a batch, each capture is roughly a technical replicate because the cells come from the same population (a tube containing a mixture of PBMCs from different donors labelled with HTOs) which are then distributed across the 5 captures.

Initial problematic capture

The problematic capture is capture 1 of batch 13 (C120_batch13_1). The number of non-empty droplets in this capture is 3x larger than is reasonable/expected (~65,000 rather than ~20,000 non-empty droplets from 1 lane (capture) of 10x 5' scRNA-seq data). The other captures within this batch have similar and expected numbers of non-empty droplets. After excluding mitochondrial, ribosomal protein, TCR, and BCR genes (hereafter referred to as gene-filtering), emptyDrops() returns a number of non-empty droplets for C120_batch13_1 that is more like what I expect (~20,000). Interestingly, Cell Ranger (v7) reports ~20,000 cells for this capture, so presumably they do some gene-filtering before applying their implementation of the Empty Drops algorithm.

Capture	`emptyDrops()` without gene-filtering	`emptyDrops()` with gene-filtering	Cell Ranger (v7)
C120_batch13_1	65,561	20,288	20,209
C120_batch13_2	22,627	22,694	22,602
C120_batch13_3	22,020	22,169	22,102
C120_batch13_4	22,297	22,504	22,425
C120_batch13_5	24,739	24,786	24,656

Following the advice in OSCA Advanced 'Testing for empty droplets' and 'emptyDrops behaviour when excluding mitochondrial/ribosomal genes', I made:

Barcode rank plots.
A histogram of P-values for low-total barcodes.
MA plot comparing the retained low-count barcodes to the discarded barcodes.

Barcode rank plots

In these plots, the top row is without gene filtering and the bottom row is after filtering out mitochondrial, ribosomal protein, TCR, and BCR genes. Nothing seems amiss in the barcode plots, with the knee being well-identified in all 5 captures. The same is true without and with gene-filtering.

Barcode rank plots for `batch13`

Histogram of P-values

In these plots, the top row is without gene-filtering and the bottom row is with gene-filtering. You can clearly see the P-value distribution is not uniform for C120_batch13_1 without gene-filtering but looks reasonable-ish for the other 4 captures. I was a little surprised that the P-value histograms are, if anything, 'worse' after gene filtering. However, given the non-empty droplet calls look more 'reasonable' after gene filtering, I'm leaning towards overlooking this anomaly(?), although it would be good to understand why this is and if I should be more concerned by it.

P-value histograms for `batch13`

MA plot

In these plots,

e0 = emptyDrops(m, test.ambient = TRUE)
e1 = emptyDrops(m[keep, ], test.embient = TRUE) where keep is a logical vector that is FALSE for mitochondrial, ribosomal protein, TCR, and BCR genes and TRUE otherwise.

You can see in C120_batch13_1 that there's a lot of genes with large logFCs but that the TCR genes stand out. I'm a bit miffed why this has only affected 1/5 captures in this batch and I assume that this capture is likely to be more affected by ambient RNA which I'll need to keep any eye on in downstream analysis.

In the subtitle of each plot I've reported the number of droplets exclusively identified in e0 (|e0_only|) or exclusively identified in e1 (|e1_only|) as non-empty.

MA plots for `batch13`

Extension to other captures in this experiment

I then decided to re-run emptyDrops() on each of the 60 captures both without and with gene filtering to see if any of the other batches might be affected (I had not previously looked very closely at the emptyDrops() diagnostics because I'd obtained seemingly sensible estimates of the number of non-empty droplets).

I've uploaded PDFs of the barcode rank plots, P-value histograms, and MA plots.

A few things stand out:

Barcode plots
- The barcode rank plots all look okay, with the knee point well-identified even when there are small 'kinks' in the plot (e.g., batch5, batch10).
- The plots look much the same whether or not gene filtering was applied.
P-value histograms
- The distributions are generally quite similar within a batch.
- A non-uniform distribution, with a peak near zero, occurs more frequently than I'd expected to see (batch3, batch4, batch5, batch8, batch9, batch10, batch13 batch14)
- Occasionally there are distributions with a peak near one for an almost U-shaped distribution (batch6)
- Filtering genes does not substantially change the P-value histograms. If anything, it sometimes makes it 'worse'.
- Occasionally there is within-batch variation (exemplified by batch10_1 being less uniform than other captures in that batch but also e.g., batch7_5 is perhaps less uniform than other batch7 captures)
MA plots
- For most captures the number of droplets exclusively identified without or with gene filtering is relatively small (<5%, being a few hundred out of ~20,000 droplets). Consequently, the pseudobulk profiles of the e0_only droplets, shown in the MA plots, aren't based on very large library sizes and the MA plots are quite 'sparse'.
- The MA plots are generally quite similar within a batch.
- The set of genes that are more highly expressed in the e0_only droplets than in the ambient droplets, if they exist, are almost always readily identifiable as ribo, tcr, or bcr enriched:
  - ribo: batch3, batch9,
  - tcr: batch13 (specifically batch13_1)
  - bcr: batch3, batch5, batch6, batch10
- batch13_1 stands out as having a lot of other (i.e. non-[mito|ribo|tcr|bcr]) genes that are highly expressed in the e0_only droplets over the ambient droplets.
- The mito genes aren't consistent across batches, but are perhaps more often highly expressed in the ambient droplets than in the e0_only droplets.

Conclusions and remaining questions

I'm satisfied I've found the root cause of the 'too many cells' issue for C120_batch13_1 and have a remedy by filtering out mitochondrial, ribosomal protein, TCR, and BCR genes prior to running emptyDrops(). However, I'd appreciate any thoughts on these questions that are still dogging me:

Should I apply this gene-filtering prior to running emptyDrops() for (1) all captures in this batch or (2) even all batches in this experiment or (3) even more broadly (e.g., all 5' datasets or all datasets that include large numbers of T cells or B cells)?
Could filtering out BCR (resp. TCR) genes from emptyDrops() lead to B cells (resp. T cells) being marked as empty droplets? E.g., from my woeful knowledge of immunology I _think_ there are certain types of B cells where most of the gene expression is from the immunoglobulins (BCR genes).
How concerned should I be by the non-uniform P-value distributions, particularly because it's seemingly sometimes exacerbated by the gene-filtering?

DropletUtils emptyDrops scRNA scRNAseq 10x • 3.4k views

ADD COMMENT • link 2.5 years ago Peter Hickey ▴ 750

0

Entering edit mode

I've had some further and ongoing discussion with 10x tech support. I've copied the initial reply below and will add any useful updates.

In short, for GEX+VDJ libraries Cell Ranger does additional filtering of barcodes based on the VDJ data (i.e. it's more than just their implementation of the EmptyDrops algorithm applied to the GEX data). I'm still not sure I've quite got my head around it, however, because the information in https://support.10xgenomics.com/single-cell-vdj/software/pipelines/latest/using/multi#why seems mostly to be about refining non-empty droplet calls for T and B cells. How, then, does Cell Ranger (specifically cellranger multi) estimate/identify non-empty droplets that don't contain T or B cells?

Anyway, as I pointed out in my reply, it would really help if the source code of Cell Ranger was easily browsable and publicly available, because AFAICT the source code is only available up to v3.0.2 (https://github.com/10XGenomics/cellranger).

Knowing what I do now about how Cell Ranger is applying additional and assay-specific steps for non-empty droplet detection in GEX+VDJ+ADT samples, I think I will opt to use Cell Ranger's non-empty droplet detection for these samples.

10x tech support's reply

From 10x Genomics' perspective DropletUtils is a 3rd party tool, therefore, your support for this tool is very limited. My understanding is that when you are running the emptyDrops() function, you are finding a larger than usual number of non-empty cells due to the presence of ambient RNA. The Cell Ranger pipeline for V(D)J libraries have filters in place that detect droplets with potentially ambient RNA, and these contigs and associated barcodes are filtered out. This is done is multiple steps throughout the heuristics of the algorithm (for example, filtering out contigs with low confidence, filtering out contigs that are not "productive" etc). Details about how the algorithm works on the V(D)J library can be found here: https://support.10xgenomics.com/single-cell-vdj/software/pipelines/latest/algorithms/overview.

An important thing to notice is that the cellranger vdj pipeline (integrated into the cellranger multi pipeline, if you have used that), does not have an implementation of EmptyDrops like cellranger count (for GEX library) does (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/algorithms/overview#cell_calling). Therefore, based on the information I have now, I would like to suggest that if you have V(D)J library processed with Cell Ranger, then you can most likely trust that data for downstream analysis. Unless you see gross over-calling of cells in the V(D)J data or multiple clonotypes that are only single chains or oddly chimeric, I would not start to doubt the Cell Ranger results.

Based on what I have written above, you may wonder that in case of cellranger multi pipeline, how would the pipeline escape the EmptyDrops stage. Details about the multi pipeline are available here: https://support.10xgenomics.com/single-cell-vdj/software/pipelines/latest/using/multi#why

Basically, the GEX data cell calling happens with EmptyDrops in the early stages, and then V(D)J data cell calling happens according to the cellranger vdj algorithm. Cells that are called only in the V(D)J library but not in the GEX library, are filtered out from the final results. This is done to ensure more accurate cell calling on the V(D)J part, that is, even if the cellranger vdj algorithm decides that a barcode is cellular and not background, if it is background from the GEX perspective, then it is filtered out, therefore, further reducing the chance of assigning a barcode as a cell when it is merely associated with ambient RNA.

ADD REPLY • link 2.5 years ago Peter Hickey ▴ 750

score 1 · Answer 1 · 2022-08-05

I've encountered similar issues (i.e. inconsistency between CellRanger's implementation of emptyDrops() and DropletUtils::emptyDrops defaults) due to ambient profile differences. Also, I presume your batches are multiplexed (as you are using v2 chemistry?). I ask because I've often wondered if overloading the lanes can affect the ambient profile for samples (Although I am don't remember all the 10X Chromium specifications offhand)

A few disparate thoughts:

For your problematic C120_batch13_1 sample, it seems the TCRs are missing from the ambient pool (hence you are capturing an extra 40k 'false-positive' cells being driven by TCR genes).
Maybe examine the library sizes of the "real barcodes" and "false positive barcodes" ? (we already know that the empty barcodes have library size < 100 (emptyDrops))
I have toyed with the idea of having a batch-wide/"global" ambient pool; this would alleviate/eliminate the issue seen in C120_batch13_1.
I don't think you risk losing real B/T cells by just filtering on the BCR/TCR genes (prior to calling emptyDrops()). It seems (biologically) unlikely that a barcode is highly enriched for BCR/TCR genes while other B/T cell markers have (relatively) low UMI counts. Simply speaking, the BCR/TCR gene expression/UMIs alone shouldn't dictate what is a real cell or not.

score 1 · Answer 2 · 2022-08-07

Perhaps there is something wonky with the TCR sequences that exclude them from the barcodes with below 100 counts. Maybe TCR sequences are prone to aggregation and so, when you observe them in an empty droplet, they always come with high counts. You could check this with DropletUtils::emptyDropsCellRanger, which - among other things - uses CellRanger's definition of an ambient profile. IIRC the main difference is that it uses barcodes in a more restricted range of total counts - something like 45000-th to 90000-th barcodes with the largest totals - compared to my original emptyDrops implementation, which takes all barcodes with total counts below 100.

As for the p-value distributions; nothing obvious comes to mind. Maybe the ribosomal genes were having a "buffering" effect on minor violations of the null in other genes, assuming that the null was true for the ribosomal genes in most samples; their relatively large counts would mean that they dominate the p-value considerations when the total count is low and the evidence for rejection from other genes is near-zero. Having said that, the distributions after filtering are not a disaster zone of model misspecification (e.g., with spikes at zero and 1) so I'd just move on.