Question

Usage of doubletCells() function before or after the batch correction?

1

Entering edit mode

hamza_karakurt ▴ 60

@hamza_karakurt-17704

Last seen 3.5 years ago

Turkey

Hello, I am doing a scRNA-Seq analysis and I want to use doubletCells() function to identify possible doublets. My data comes from 4 different batches and I use fastMNN for batch correction. Which way would be better in this situation? Using doubletCells() for each data before batch correction and remove cells with high scores as doublets and doing the batch correction or after fastMNN(), using the doubletCells() function on counts of all data sets (I think I have to use computeSumFactors() function with counts of all data sets).

Thank you in advance.

scater scran scRNA-Seq doubletCells • 4.1k views

ADD COMMENT • link updated 6.8 years ago by Aaron Lun ★ 29k • written 6.8 years ago by hamza_karakurt ▴ 60

score 5 · Answer 1 · 2019-04-16

5

Entering edit mode

Aaron Lun ★ 29k

@alun

Last seen 12 hours ago

The city by the bay

There's a number of ways to do this, but in all cases, you should be computing doublet scores within each batch. It is obviously impossible to get a doublet consisting of cells from different batches! My favored approach is to:

Compute doublet scores within each batch, but do not remove them.
Do the batch correction with all cells.
Mark clusters as doublets if they contain many cells with high doublet scores.

This is motivated by the fact that not all doublets will be assigned high doublet scores. (This is simply a consequence of the assumptions that are necessary to get doubletCells to work, see comments here.) By leaving in the doublets, we can use "guilt by association" to identify the cluster of doublet cells. If we removed all cells with high doublet scores beforehand, we would not be able to detect these troublesome clusters as all of the remaining doublets would have low scores.

From a workflow perspective, doublets are of such low frequency that leaving them in will probably not do much harm. In addition, they are fairly well behaved as sequencing libraries go (e.g., high library sizes, lots of detected genes) and their expression profiles are, by definition, within the range of observed expression profiles in the population (e.g., you won't get different HVGs during feature selection). This is unlike, say, low-quality libraries that could really interfere with your normalization, feature selection, PCA, etc.

ADD COMMENT • link 6.8 years ago Aaron Lun ★ 29k

0

Entering edit mode

Thank you Aaron, You approach looks really useful and makes sense. I was thinking the same but just wanted to be sure. After computing doublet scores for each data, I will merge the scores to create a vector (same length as cell number) and assign them into the corrected SingleCellExperiment object and I will use t-SNE to examine the clusters.

For a single data set, is there a threshold for doublet scores or using NMADS is an option as usual?

Thank you in advance.

ADD REPLY • link 6.8 years ago hamza_karakurt ▴ 60

2

Entering edit mode

I have recently been through this for a set of many 10X samples (what I ended up doing is shown here)

In essence, I first calculated the scores and called doublets within samples, then performed another round of calling across all samples to identify where I had missed calls in individual samples. Or, in more depth:

Get scores separately within each sample
Calculate clusters within each sample (I had to really cluster finely to properly separate the doublet clusters, by the way)
Call doublet clusters in each sample (e.g. by identifying outlying clusters with high median doublet score). Label all cells in the doublet clusters as doublets.
Batch correct all samples together
Cluster within the all-sample corrected data
Identify all-sample clusters that contain a disproportionately high number of cells that were called as doublets in their own samples; label all cells in these clusters as doublets. This is the across-sample sweep step.

This is shown with figures in the HTML file in the link I have above. There are some things I would change in retrospect (e.g. using NMADS as you say). I note that the difficulty of clustering and identifying doublets will depend a lot on how different the cell-types actually are in your data (e.g. I suspect adult tissue would be easier than my embryonic samples). Also I would recommend visualising your scores and clusters on e.g. t-SNE all the way through to make sure nothing crazy is happening.

I hope this is useful!

ADD REPLY • link 6.8 years ago Jonathan Griffiths ▴ 90

0

Entering edit mode

double post, oops...

ADD REPLY • link 6.8 years ago Jonathan Griffiths ▴ 90