5 months ago by
Cambridge, United Kingdom
There's a number of ways to do this, but in all cases, you should be computing doublet scores within each batch. It is obviously impossible to get a doublet consisting of cells from different batches! My favored approach is to:
- Compute doublet scores within each batch, but do not remove them.
- Do the batch correction with all cells.
- Mark clusters as doublets if they contain many cells with high doublet scores.
This is motivated by the fact that not all doublets will be assigned high doublet scores. (This is simply a consequence of the assumptions that are necessary to get
doubletCells to work, see comments here.) By leaving in the doublets, we can use "guilt by association" to identify the cluster of doublet cells. If we removed all cells with high doublet scores beforehand, we would not be able to detect these troublesome clusters as all of the remaining doublets would have low scores.
From a workflow perspective, doublets are of such low frequency that leaving them in will probably not do much harm. In addition, they are fairly well behaved as sequencing libraries go (e.g., high library sizes, lots of detected genes) and their expression profiles are, by definition, within the range of observed expression profiles in the population (e.g., you won't get different HVGs during feature selection). This is unlike, say, low-quality libraries that could really interfere with your normalization, feature selection, PCA, etc.
modified 5 months ago
5 months ago by
Aaron Lun • 24k