This time I am trying to figure out how can I determine if I have obtained optimal number of clusters using the buildSNNgraph and igraph approach.
My apartment has an amazing view that gives me a panorama across San Francisco. If I stare out the window, I can see the apartment blocks, the major arteries going in and out of the city, the mountain ranges in the distance; if I stare a bit longer, I start to notice the smaller buildings, the main roads, the parks; and if I continue watching, I can make out trees, vehicles, pedestrians and the occasional dog.
Which one of those views is correct? Well, what do you care about? I'm more of a landscape guy, so I like to sit back in the comfort of my living room and enjoy the vista in front of me. ("Everything the light touches...", etc.) But that might not be the case in other contexts. If I had to deliver something, I would need to care about the finer detail of the city's layout, and if I was waiting for someone, I would need to keep an eye on all the people passing by.
Now, I'm not (just) bragging about my apartment here, because this serves as a good metaphor for clustering resolution. Do you care about the major cell types? Do you want to cut them up into subtypes? Do you want to cut them further into different metabolic/stress/activation states? The data doesn't exist in a vacuum, so adjust the resolution until you get clusters that can help you answer your scientific question.
You might feel that's an unsatisfying answer. "If there's no right or wrong clustering, I could just try a whole bunch of different clusterings and choose the one that I like the best," you might say. Well, yes! You can do that! The vast majority of single-cell data analysis is exploratory and, in my opinion, should be aiming to generate hypotheses. It's fine to play fast and loose here to get a piece of useful information from the dataset, as long as you are willing to test the subsequent hypotheses with an independent experiment.
I recently read a method paper, IKAP, that is based on Seurat object and Seurat clustering functions, where it identifies optimal number PCs and resolution parameter to identify optimal clustering using gap statistics values.
At the risk of sounding like a grumpy old man, I would say that the optimal number of clusters is the number of clusters that you and/or your collaborators can be bothered to annotate and/or validate.
Now, when I try to run, I get an error about size allotment, as I am trying to run it on my laptop. So I don't even know if this approach will work.
Good lord, don't try to run it on a real dataset. You will see that my comments on k-means in the book are rather equivocal, because I don't think it's a good frontline clustering method that yields interpretable results. It is, however, a good data compression approach that can be used in conjunction with other approaches to speed them up. (Note that, when I'm talking about speed, I'm referring to kmeans
not clusGap
; the latter is only necessary if you were aiming to use k-means as your frontline method.)
So I am wondering if this is a legitimate strategy or am I committing a mistake of comparing apples (graph based clusters) to oranges (k-mean based clusters).
Most graph-based clustering algorithms have a natural way of choosing the "optimal" number of clusters, via maximization of the modularity score. This is a metric that - well, read the book. Now, I quoted "optimal" above because this may not have much relevance to what is best for your situation. The modularity is an arbitrary measure that has several different definitions, while the nearest neighbor graph is an artificial construct that depends on some choice of k
and some definition of the weighting of shared neighbors.
Thanks Aaron for your comments and congrats for having a house with a nice view in SF, hope you can see the golden gate bridge too with appropriate parameters.
I do get the point that I could increase of decrease resolution based on my need, however, I just wanted to know if there is a way to identify optimal range of clusters based on some statistical tool or definition, below (or above) which I am under (or over) clustering.
In the dataset I am analyzing, just like other I have seen in large dataset integration (for example, mouse nervous system atlas), the non-neuronal and some specific neuronal population are already clustered distinctly even at lower resolution . However, to define some distinct cell types, that we know exist based on experimental biology, in those big blobs (in tsne or umap) of cell types that are labelled as just excitatory or inhibitory neurons, what would have been a more algorithmic approach, rather than just saying i increased resolution until I got the cell types I wanted.
When I plot some known genes,for example that are involved in anterior posterior patterning, or follow temporal order during cell state specification, I see that they show spatial distribution in one cluster, so based on my prior knowledge of biology, I want that cluster to be separate. Since it wasn't, I felt may be I am under clustering. Second, I also want to subcluster the big clusters that were obtained in step one, to further dissect those big cluster of specific neuronal types into probably postional or temporal cell types? But someone could say, why didn't I just increase the resolution in the big cluster itself, rather than subclustering? For that I wanted to argue that statistically we identified those clusters at the gross level and they are "optimal", hence to further identify cell types, we sub-clustered the major clusters, using a recursive method (probably just two recursive levels). Hope my argument is not too confusing.
Only the Bay bridge, I'm afraid. Couldn't bear to fork out an extra 1K per month to get a similarly good view on the other side.
If you insist, you could use
bootstrapCluster()
to determine whether your clusters are stable. High coassignment probabilities mean that you're overclustering to the point that the clusters are not robust to sampling noise. However, this is probably less useful than it sounds; if the clusters are so small that they can't even handle sampling noise, the resolution is probably for too high for meaningful interpretation.Here's another way of looking at it: imagine that you have a perfect assignment of cells to their true types. (For argument's sake, let's work with T cells here.) Now you need to write up the results. What kind of resolution is relevant to you? CD4-positive vs CD8-positive? Memory vs effector? Mature vs immature? Some-combination-of-CD-positives vs the-opposite-combination-of-CD-negatives? This is inherently a human decision - if you can make it, then you can just choose the clustering resolution to match, but if you can't make it, no algorithm can help you.
And... what's wrong with that? I think that's totally fine.
Let me predict what will happen with these algorithmic approaches. You'll try one of them - it may or may not give satisfactory results. You'll then think, "Oh, better check another approach." So you'll try another one and it will most likely give different results to the first, due to differences in its parameter settings or assumptions. Repeat the process with a whole bunch of methods and you'll end up with a whole bunch of different clusterings... from which you will choose one clustering that seems to be the most interesting/sensible/whatever you're looking for.
So if you're going to be "subjective" anyway, you might as well be upfront about it; adjust the resolution directly and save yourself a round-trip through all those algorithms. In fact, I don't think this is any more "subjective" than other day-to-day scientific decisions. For example, why do we summarize expression at the gene level? Why not work at the transcript level? Why not at the exon level? Hell, why not go down to kmer-level counts? The answer is because some of these are more interpretable than others, and that's the same thing with cluster resolution.
Sure you could, there's many ways to cut this cake. However, subclustering does offer the benefit of redefining the set of features (HVGs and PCs) to improve resolution within the big cluster, avoiding noise from irrelevant genes used to separate the big cluster from other clusters.
Hopefully you can see the contradiction here. If you're intending to subcluster, then you're conceding that the first round of clustering is not of sufficiently high resolution to capture the relevant heterogeneity. Which is perfectly acceptable to say and doesn't compromise the analysis at all, but you can't then turn around and claim that the first round was optimal.