22 months ago by
Cambridge, United Kingdom
That's an interesting question. The minimum number of cells required for deconvolution probably depends on the quality of each cell - if you don't have a lot of zeroes in each cell, or if the number of zeroes is not variable across cells, then I would expect that you don't need to pool as many cells to get accurate normalization. In fact, if you have high-quality libraries, then bulk-based methods that operate on each cell separately will actually do okay. For very-low-count data (e.g., inDrop, Zeisel et al.'s brain data), the deconvolution approach works with as few as 100 cells per cluster, but I've noticed that the precision of the estimates start to deteriorate when the number of cells decrease. Which makes sense, because the method works by sharing information across cells and there's less information when there's fewer cells.
If your data is like the low count data mentioned above, then 90 cells will probably be borderline (and definitely not enough if they're split into three subpopulations, such that you'd only have 30 cells on average in each subpopulation). In such cases, I would try doing running
computeSumFactors without any clustering and hope that you don't have large numbers of DE genes between your subpopulations. On the other hand, if your data has higher coverage, then you might be able to get away with fewer cells (e.g., set
sizes=c(5, 10, 15, 20), assuming your subpopulations are around 30 cells each; you'll have to turn down
quickCluster as well, or define your clusters manually). It's worth a shot, at least - well, it's not like it would do worse than standard normalization methods, so you might as well give it a go.
In any case, I always plot the deconvolution size factors against the library sizes, just as a sanity check. For low count data, these two methods are the most similar (relative to DESeq and TMM normalization, which give distorted estimates) so it's generally a good idea to check that their estimates are roughly correlated. Of course, some scatter in this plot is expected, with differences between the normalization strategies due to DE between cells (against which library size normalization is not robust).
Also, I don't know how nicely the deconvolution method plays with Kallisto-derived counts. In theory, it shouldn't matter if the values are interpretable as counts, but I haven't checked it out in practice.
modified 22 months ago
22 months ago by
Aaron Lun • 18k