19 days ago by
Cambridge, United Kingdom
Is there a reason that (a) the size factors from computeSumFactors are not used instead in the vignette
Well, actually, calling
doubletCells() on a
SingleCellExperiment object will use the size factors in that object (see what happens when
size.factors.norm=NA). So they should get used, otherwise it's a bug.
That being said, it's worth reading the background about normalization in the doublet detection context. Both the library size factors and the deconvolution size factors are poor proxies for the true RNA content of each cell, which is the more important scaling factor as it determines the ratio in which cells should be mixed together to create simulated doublets. So it doesn't really matter which exact set of size factors you use, as they're both meh.
(b) doubletCells doesn't use denoisePCA internally?
This was just because it was easier to implement - calling
denoisePCA() would requiring passing the technical noise estimates, which ruins the nice easy one-liner that you can currently do. Besides, the function still worked without requiring a more careful choice of features and PCs, so I just went with the simpler thing. If you want, you can force the internal PCA to be like
denoisePCA() by setting
subset.row= to only those genes with positive biological components and setting
d= to the number of PCs returned by
I would be mildly surprised if this introduced a significant difference in the results... but only mildly, because the doublet-detection-by-simulation strategy is pretty fragile in the first place (due to the normalization issues).
P.S. If it sounds like I'm pretty negative about
doubletCells(), well, I am. (And I'm also just a negative guy in general.) There are some strong assumptions about RNA content that need to be made when simulating doublets - here, and possibly also in similar functions by other authors - which makes me take the results of doublet calls based on the simulations with a tablespoon of salt. It is to my surprise that the function seems to work and give sensible results for me and my colleagues - but if you want something that you can trust, I would say that you'll need experimental data, e.g., cell hashing, multiplexed genotypes, or known biological impossibilities of co-expression. Without this... well, you won't get what you don't pay for.
modified 19 days ago
19 days ago by
Aaron Lun • 24k