emptyDrops() calling 'too many' non-empty droplets
2
0
Entering edit mode
Peter Hickey ▴ 560
@petehaitch
Last seen 4 days ago
Walter and Eliza Hall Institute of Medi…

(Apologies, this got rather long ...)

I've encountered a situation where DropletUtils::emptyDrops() was calling 'too many' non-empty droplets and would appreciate some advice. Initially, I noticed this affecting 1/60 captures in a large scRNA-seq experiment I'm analysing. But upon investigating the issue further I'm wondering whether I should also revise how non-empty droplets are being identified in the other 59/60 captures (and more generally in other similar-ish experiments).

I found similarities to a previously reported issue ('emptyDrops behaviour when excluding mitochondrial/ribosomal genes') whereby the performance of emptyDrops(), and in particular the identification of the knee point in the barcode rank plot, was improved by excluding mitochondrial and ribosomal protein genes (this is now documented in the 'Non-empty droplets versus cells' section of the man page for emptyDrops()). In my case, the knee point is being properly identified, but there are still too many non-empty droplets. Another point of difference to the earlier report is that the genes driving the differences between the retained low-count barcodes and discarded barcodes in this capture are T-cell receptor (TCR) genes, which I suspect may suggest the need for assay-specific or sample-specific gene filtering.

Some details of my experiment:

• Human PBMCs profiled with 10x Genomics 5' chemistry (v2), including gene expression (GEX), a large panel (BioLegend's TotalSeq™-C Human Universal Cocktail) of antibody derived tags (ADTs), and VDJ sequencing of T-cell receptors (TCRs) and B-cell receptors (BCRs).
• 12 batches (named 3-14 for reasons), with each batch containing 5 captures, and targeting ~20,000 cells/capture.
• Within a batch, each capture is roughly a technical replicate because the cells come from the same population (a tube containing a mixture of PBMCs from different donors labelled with HTOs) which are then distributed across the 5 captures.

Initial problematic capture

The problematic capture is capture 1 of batch 13 (C120_batch13_1). The number of non-empty droplets in this capture is 3x larger than is reasonable/expected (~65,000 rather than ~20,000 non-empty droplets from 1 lane (capture) of 10x 5' scRNA-seq data). The other captures within this batch have similar and expected numbers of non-empty droplets. After excluding mitochondrial, ribosomal protein, TCR, and BCR genes (hereafter referred to as gene-filtering), emptyDrops() returns a number of non-empty droplets for C120_batch13_1 that is more like what I expect (~20,000). Interestingly, Cell Ranger (v7) reports ~20,000 cells for this capture, so presumably they do some gene-filtering before applying their implementation of the Empty Drops algorithm.

Capture emptyDrops() without gene-filtering emptyDrops() with gene-filtering Cell Ranger (v7)
C120_batch13_1 65,561 20,288 20,209
C120_batch13_2 22,627 22,694 22,602
C120_batch13_3 22,020 22,169 22,102
C120_batch13_4 22,297 22,504 22,425
C120_batch13_5 24,739 24,786 24,656

Following the advice in OSCA Advanced 'Testing for empty droplets' and 'emptyDrops behaviour when excluding mitochondrial/ribosomal genes', I made:

1. Barcode rank plots.
2. A histogram of P-values for low-total barcodes.
3. MA plot comparing the retained low-count barcodes to the discarded barcodes.

Barcode rank plots

In these plots, the top row is without gene filtering and the bottom row is after filtering out mitochondrial, ribosomal protein, TCR, and BCR genes. Nothing seems amiss in the barcode plots, with the knee being well-identified in all 5 captures. The same is true without and with gene-filtering.

Histogram of P-values

In these plots, the top row is without gene-filtering and the bottom row is with gene-filtering. You can clearly see the P-value distribution is not uniform for C120_batch13_1 without gene-filtering but looks reasonable-ish for the other 4 captures. I was a little surprised that the P-value histograms are, if anything, 'worse' after gene filtering. However, given the non-empty droplet calls look more 'reasonable' after gene filtering, I'm leaning towards overlooking this anomaly(?), although it would be good to understand why this is and if I should be more concerned by it.

MA plot

In these plots,

• e0 = emptyDrops(m, test.ambient = TRUE)
• e1 = emptyDrops(m[keep, ], test.embient = TRUE) where keep is a logical vector that is FALSE for mitochondrial, ribosomal protein, TCR, and BCR genes and TRUE otherwise.

You can see in C120_batch13_1 that there's a lot of genes with large logFCs but that the TCR genes stand out. I'm a bit miffed why this has only affected 1/5 captures in this batch and I assume that this capture is likely to be more affected by ambient RNA which I'll need to keep any eye on in downstream analysis.

In the subtitle of each plot I've reported the number of droplets exclusively identified in e0 (|e0_only|) or exclusively identified in e1 (|e1_only|) as non-empty.

Extension to other captures in this experiment

I then decided to re-run emptyDrops() on each of the 60 captures both without and with gene filtering to see if any of the other batches might be affected (I had not previously looked very closely at the emptyDrops() diagnostics because I'd obtained seemingly sensible estimates of the number of non-empty droplets).

I've uploaded PDFs of the barcode rank plots, P-value histograms, and MA plots.

A few things stand out:

• Barcode plots
• The barcode rank plots all look okay, with the knee point well-identified even when there are small 'kinks' in the plot (e.g., batch5, batch10).
• The plots look much the same whether or not gene filtering was applied.
• P-value histograms
• The distributions are generally quite similar within a batch.
• A non-uniform distribution, with a peak near zero, occurs more frequently than I'd expected to see (batch3, batch4, batch5, batch8, batch9, batch10, batch13 batch14)
• Occasionally there are distributions with a peak near one for an almost U-shaped distribution (batch6)
• Filtering genes does not substantially change the P-value histograms. If anything, it sometimes makes it 'worse'.
• Occasionally there is within-batch variation (exemplified by batch10_1 being less uniform than other captures in that batch but also e.g., batch7_5 is perhaps less uniform than other batch7 captures)
• MA plots
• For most captures the number of droplets exclusively identified without or with gene filtering is relatively small (<5%, being a few hundred out of ~20,000 droplets). Consequently, the pseudobulk profiles of the e0_only droplets, shown in the MA plots, aren't based on very large library sizes and the MA plots are quite 'sparse'.
• The MA plots are generally quite similar within a batch.
• The set of genes that are more highly expressed in the e0_only droplets than in the ambient droplets, if they exist, are almost always readily identifiable as ribo, tcr, or bcr enriched:
• ribo: batch3, batch9,
• tcr: batch13 (specifically batch13_1)
• bcr: batch3, batch5, batch6, batch10
• batch13_1 stands out as having a lot of other (i.e. non-[mito|ribo|tcr|bcr]) genes that are highly expressed in the e0_only droplets over the ambient droplets.
• The mito genes aren't consistent across batches, but are perhaps more often highly expressed in the ambient droplets than in the e0_only droplets.

Conclusions and remaining questions

I'm satisfied I've found the root cause of the 'too many cells' issue for C120_batch13_1 and have a remedy by filtering out mitochondrial, ribosomal protein, TCR, and BCR genes prior to running emptyDrops(). However, I'd appreciate any thoughts on these questions that are still dogging me:

• Should I apply this gene-filtering prior to running emptyDrops() for (1) all captures in this batch or (2) even all batches in this experiment or (3) even more broadly (e.g., all 5' datasets or all datasets that include large numbers of T cells or B cells)?
• Could filtering out BCR (resp. TCR) genes from emptyDrops() lead to B cells (resp. T cells) being marked as empty droplets? E.g., from my woeful knowledge of immunology I _think_ there are certain types of B cells where most of the gene expression is from the immunoglobulins (BCR genes).
• How concerned should I be by the non-uniform P-value distributions, particularly because it's seemingly sometimes exacerbated by the gene-filtering?
DropletUtils emptyDrops scRNA scRNAseq 10x • 151 views
0
Entering edit mode
@68acdb1e
Last seen 4 days ago
Switzerland

I've encountered similar issues (i.e. inconsistency between CellRanger's implementation of emptyDrops() and DropletUtils::emptyDrops defaults) due to ambient profile differences. Also, I presume your batches are multiplexed (as you are using v2 chemistry?). I ask because I've often wondered if overloading the lanes can affect the ambient profile for samples (Although I am don't remember all the 10X Chromium specifications offhand)

A few disparate thoughts:

• For your problematic C120_batch13_1 sample, it seems the TCRs are missing from the ambient pool (hence you are capturing an extra 40k 'false-positive' cells being driven by TCR genes).
• Maybe examine the library sizes of the "real barcodes" and "false positive barcodes" ? (we already know that the empty barcodes have library size < 100 (emptyDrops))
• I have toyed with the idea of having a batch-wide/"global" ambient pool; this would alleviate/eliminate the issue seen in C120_batch13_1.
• I don't think you risk losing real B/T cells by just filtering on the BCR/TCR genes (prior to calling emptyDrops()). It seems (biologically) unlikely that a barcode is highly enriched for BCR/TCR genes while other B/T cell markers have (relatively) low UMI counts. Simply speaking, the BCR/TCR gene expression/UMIs alone shouldn't dictate what is a real cell or not.
0
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 3 hours ago
The city by the bay

Perhaps there is something wonky with the TCR sequences that exclude them from the barcodes with below 100 counts. Maybe TCR sequences are prone to aggregation and so, when you observe them in an empty droplet, they always come with high counts. You could check this with DropletUtils::emptyDropsCellRanger, which - among other things - uses CellRanger's definition of an ambient profile. IIRC the main difference is that it uses barcodes in a more restricted range of total counts - something like 45000-th to 90000-th barcodes with the largest totals - compared to my original emptyDrops implementation, which takes all barcodes with total counts below 100.

As for the p-value distributions; nothing obvious comes to mind. Maybe the ribosomal genes were having a "buffering" effect on minor violations of the null in other genes, assuming that the null was true for the ribosomal genes in most samples; their relatively large counts would mean that they dominate the p-value considerations when the total count is low and the evidence for rejection from other genes is near-zero. Having said that, the distributions after filtering are not a disaster zone of model misspecification (e.g., with spikes at zero and 1) so I'd just move on.