Hi, I'm working with a mostly homogeneous population of cultured lung cells infected with influenza virus (10X Genomics Chromium single cell). After filtering, I have roughly 5K infected cells, 4K exposed but uninfected cells, and 4K unexposed cells. Basically, I'd like to know if the pre-clustering step for Scran normalization is still appropriate in my case or if this step was mostly intended for heterogeneous groups of cells (e.g. from tissue). I'm mostly interested in investigating differences in host cells response, and I'm concerned (or perhaps a little confused) about how the pre-clustering step would affect downstream analyses such as DGE using viral factors (there is heterogeneity at the virus level in how many virus genes are present/functional, etc). Any suggestions/recomendations would be greatly appreciated! Thanks, Cris
The clusters are designed to mitigate the effect of DE between subpopulations when computing size factors. If you definitely have no subpopulations in your data, then the clusters won't harm or help. However, if you do have subpopulations, then the clustering will improve the accuracy of the size factor calculations.
Keep in mind that we're not just removing differences within clusters, we are also removing systematic differences between clusters. From a conceptual level, if you're trying to make A, B, C and D equal, the order in which you remove differences doesn't matter. You could make A and B equal first, then C and D, and then make A/B equal to C/D. Or you could do A and B, then make A/B equal to C, then A/B/C equal to D. In the end, everything is equal so there's no problem(*).
This is unlike other applications like imputation where the algorithm aims remove variation within some structures while preserving differences between structures. In such cases, clustering can introduce artificial structure or make weak structure look more convincing than it really is.
*: In practice, the order in which do these steps will affect the accuracy of the resulting size factors for various numerical reasons; hence the need for clustering. But provided each step is accurate, the final result should be the same regardless of the clustering (unlike other applications, which are highly sensitive to the initial clustering). You can test this by fiddling with the clustering parameters and seeing whether the size factor estimates correlate well.