Question: Preprocessing in scran::doubletCells
0
11 weeks ago by
Angelos Armen0 wrote:

Hi,

In the Detecting doublets in single-cell RNA-seq data vignette, scran::computeSumFactors and scran::denoisePCA are used to compute size factors and choose the number of PCs, respectively, before using scran::doubletCluster to detect doublet clusters. However scran::doubletCells is called with default values for parameters size.factors.norm and k, so library-size normalisation is performed and the top 50 PCs are selected. Is there a reason that (a) the size factors from computeSumFactors are not used instead in the vignette and (b) doubletCells doesn't use denoisePCA internally?

scran simplesinglecell • 162 views
modified 11 weeks ago by Aaron Lun25k • written 11 weeks ago by Angelos Armen0
1
11 weeks ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

Is there a reason that (a) the size factors from computeSumFactors are not used instead in the vignette

Well, actually, calling doubletCells() on a SingleCellExperiment object will use the size factors in that object (see what happens when size.factors.norm=NA). So they should get used, otherwise it's a bug.

That being said, it's worth reading the background about normalization in the doublet detection context. Both the library size factors and the deconvolution size factors are poor proxies for the true RNA content of each cell, which is the more important scaling factor as it determines the ratio in which cells should be mixed together to create simulated doublets. So it doesn't really matter which exact set of size factors you use, as they're both meh.

(b) doubletCells doesn't use denoisePCA internally?

This was just because it was easier to implement - calling denoisePCA() would requiring passing the technical noise estimates, which ruins the nice easy one-liner that you can currently do. Besides, the function still worked without requiring a more careful choice of features and PCs, so I just went with the simpler thing. If you want, you can force the internal PCA to be like denoisePCA() by setting subset.row= to only those genes with positive biological components and setting d= to the number of PCs returned by denoisePCA().

I would be mildly surprised if this introduced a significant difference in the results... but only mildly, because the doublet-detection-by-simulation strategy is pretty fragile in the first place (due to the normalization issues).

P.S. If it sounds like I'm pretty negative about doubletCells(), well, I am. (And I'm also just a negative guy in general.) There are some strong assumptions about RNA content that need to be made when simulating doublets - here, and possibly also in similar functions by other authors - which makes me take the results of doublet calls based on the simulations with a tablespoon of salt. It is to my surprise that the function seems to work and give sensible results for me and my colleagues - but if you want something that you can trust, I would say that you'll need experimental data, e.g., cell hashing, multiplexed genotypes, or known biological impossibilities of co-expression. Without this... well, you won't get what you don't pay for.

Thank you for the detailed reply Aaron. The problem with cell hashing and multiplexed genotypes is that they can only detect inter-sample doublets; intra-sample doublets go undetected. Therefore computational approaches are still needed.

intra-sample doublets go undetected

Ah - but if you play your cards right, you can have your cake and eat it too. As discussed in my workflow, the key thing is to not remove the experimentally-determined doublets at the start. Let the doublets remain and get assigned into clusters. Then, any clusters that contain lots of inter-sample doublets will probably also contain a lot of intra-sample doublets as well. This allows you to use "guilt by association" to get rid of all the doublets by flagging entire clusters at being problematic.

In fact, a bit of math allows you to determine the expected proportion of doublets that are detected as inter-sample doublets. Assume that you have n multiplexed samples with the same number of cells. Assuming doublets form equally across all cells and samples, this means that 1/n of your doublets will be intra-sample and undetectable. Thus, any cluster where you have (n-1)/n cells marked as inter-sample doublets is likely to be a cluster entirely full of doublets. If it's lower than that, probably you have a doublet cluster merged with a non-doublet cluster. You could probably adapt this math to account for unequal sample sizes and different rates of doublet formation between cell types, but the general principle remains.

Yes that makes sense. I was thinking of cell hashing used merely to multiplex samples from different donors and/or conditions. In that case, inter-sample and intra-sample doublets would form separate clusters (as cells would cluster by sample) and your strategy wouldn't be applicable. To take full advantage of cell hashing, each sample would have to be split and multiplexed as well.

By the way, would you recommend recomputing size factors with computeSumFactors after doublet removal? It seems to me that doublets don't violate the 50% non-DE assumption as they're equivalent to "cells" with size factor equal to the sum of the size factors of their parents. Therefore the size factors of all the other cells should still be valid (but not centred).

To take full advantage of cell hashing, each sample would have to be split and multiplexed as well.

That would certainly help if it increases n (and thus reduces the probability of unlabelled doublets) but should not be essential if you already have multiplexed samples to give you n > 1. I say should, because that assumes that doublets can form between samples as readily as within samples, and you could imagine some situations where this is not the case. (For example, cells from one donor being more willing to bind to cells from the same donor. Or maybe to a different donor, if they're immune cells that recognize other donor cells as foreign and start lysing them.)

Therefore the size factors of all the other cells should still be valid (but not centred).

That's correct.