5 months ago by
Cambridge, United Kingdom
Can we say they are low quality clusters?
You could say that; I prefer the term "poorly separated". However, "poorly separated" does not mean "useless". For me, clustering is just a way of breaking up the data set into parts that can be comprehended. A useful clustering procedure will deliver parts (i.e., clusters) that relate to biological concepts like cell types or status or whatever, which gives us something concrete to think about when we're trying to understand the data. If you treat clustering as a tool in this manner, then even poorly separated clusters are useful if they break up a big blob of cells into easily digestible chunks.
You might think that poorly separated clusters are less likely to be "real". But clusters are inherently empirically defined, so there's not much meaning to discussing whether they are real or not. The more pertinent question is whether two poorly-separated clusters are derived from the same underlying biological aspect or whether they correspond to different aspects, e.g., different cell types. Answering this question requires some strong assumptions about how these aspects manifest in the data, like "cells are normally distributed around an average expression profile for each cell type" (e.g., k-means). I do not find this line of investigation to be particularly interesting, and would rather spend my efforts on achieving a useful clustering that yields some hypotheses for experimental validation.
You might also think that poorly separated clusters are less stable in the sense that, under slightly different circumstances, they will merge with neighbouring clusters. This is a valid concern, but more from a perspective of logistics - it's annoying to have to re-find a poorly separated cluster if it keeps on merging with its neighbours every time you change a parameter in the upstream steps of your analysis. However, the cluster hasn't "disappeared" - the cells that make it up are still there, it's just the way you're summarizing the data that has changed.
So, sure, I would be more inclined to work on well-separated clusters, but only because it's easier. There is usually some important biology that occurs in poorly separated clusters, so it would be silly to dismiss them out of hand.
can we use their logcounts of data from different conditions (such as healthy and disease) for differential expression analysis before we use MNN?
Yes. See for example here for a pseudo-bulk analysis. The workflows here demonstrate how to do this for each cell type after clustering on the MNN-corrected values. (Note that only the clustering is done on the corrected values, the DE analysis is done on the counts!)
modified 5 months ago
5 months ago by
Aaron Lun • 24k