Optimal cluster number identification using buildSNNgraph and igraph clusters
2
0
Entering edit mode
p.joshi ▴ 30
@pjoshi-22718
Last seen 7 months ago
Germany

Hi again!

This time I am trying to figure out how can I determine if I have obtained optimal number of clusters using the buildSNNgraph and igraph approach. I can increase and decrease the resolution of clusters by changing parameter k which I know is the number of neighbors and not the number of centroids.

I recently read a method paper, IKAP, that is based on Seurat object and Seurat clustering functions, where it identifies optimal number PCs and resolution parameter to identify optimal clustering using gap statistics values. Now, as I am not such a good programmer, I don't want to take risk in modifying the code to fit the scran based functions.

In the tutorial on OSCA website, there is an example of finding optimal k for k-mean clustering using cluster::clusgap() function. I can also input my own clustering function there, which is also defined in another example.

gaps <- clusGap(com.sce, myClusterFUN, K.max=40)
myClusterFUN <- function(x) {
g <- buildSNNGraph(x, use.dimred="corrected", type="jaccard")
igraph::cluster_louvain(g)\$membership
}


Now, when I try to run, I get an error about size allotment, as I am trying to run it on my laptop. So I don't even know if this approach will work. The other thing I thought was to run a k-mean cluster to identify optimal number of cluster in the integrated single cell dataset, and then play with the resolution parameter until I get clusters around that identified based on k-mean clustering.

So I am wondering if this is a legitimate strategy or am I committing a mistake of comparing apples (graph based clusters) to oranges (k-mean based clusters).

Thanks!

scran gap statistics single cell clustering • 1.3k views
4
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 15 hours ago
The city by the bay

This time I am trying to figure out how can I determine if I have obtained optimal number of clusters using the buildSNNgraph and igraph approach.

My apartment has an amazing view that gives me a panorama across San Francisco. If I stare out the window, I can see the apartment blocks, the major arteries going in and out of the city, the mountain ranges in the distance; if I stare a bit longer, I start to notice the smaller buildings, the main roads, the parks; and if I continue watching, I can make out trees, vehicles, pedestrians and the occasional dog.

Which one of those views is correct? Well, what do you care about? I'm more of a landscape guy, so I like to sit back in the comfort of my living room and enjoy the vista in front of me. ("Everything the light touches...", etc.) But that might not be the case in other contexts. If I had to deliver something, I would need to care about the finer detail of the city's layout, and if I was waiting for someone, I would need to keep an eye on all the people passing by.

Now, I'm not (just) bragging about my apartment here, because this serves as a good metaphor for clustering resolution. Do you care about the major cell types? Do you want to cut them up into subtypes? Do you want to cut them further into different metabolic/stress/activation states? The data doesn't exist in a vacuum, so adjust the resolution until you get clusters that can help you answer your scientific question.

You might feel that's an unsatisfying answer. "If there's no right or wrong clustering, I could just try a whole bunch of different clusterings and choose the one that I like the best," you might say. Well, yes! You can do that! The vast majority of single-cell data analysis is exploratory and, in my opinion, should be aiming to generate hypotheses. It's fine to play fast and loose here to get a piece of useful information from the dataset, as long as you are willing to test the subsequent hypotheses with an independent experiment.

I recently read a method paper, IKAP, that is based on Seurat object and Seurat clustering functions, where it identifies optimal number PCs and resolution parameter to identify optimal clustering using gap statistics values.

At the risk of sounding like a grumpy old man, I would say that the optimal number of clusters is the number of clusters that you and/or your collaborators can be bothered to annotate and/or validate.

Now, when I try to run, I get an error about size allotment, as I am trying to run it on my laptop. So I don't even know if this approach will work.

Good lord, don't try to run it on a real dataset. You will see that my comments on k-means in the book are rather equivocal, because I don't think it's a good frontline clustering method that yields interpretable results. It is, however, a good data compression approach that can be used in conjunction with other approaches to speed them up. (Note that, when I'm talking about speed, I'm referring to kmeans not clusGap; the latter is only necessary if you were aiming to use k-means as your frontline method.)

So I am wondering if this is a legitimate strategy or am I committing a mistake of comparing apples (graph based clusters) to oranges (k-mean based clusters).

Most graph-based clustering algorithms have a natural way of choosing the "optimal" number of clusters, via maximization of the modularity score. This is a metric that - well, read the book. Now, I quoted "optimal" above because this may not have much relevance to what is best for your situation. The modularity is an arbitrary measure that has several different definitions, while the nearest neighbor graph is an artificial construct that depends on some choice of k and some definition of the weighting of shared neighbors.

0
Entering edit mode

Thanks Aaron for your comments and congrats for having a house with a nice view in SF, hope you can see the golden gate bridge too with appropriate parameters.

I do get the point that I could increase of decrease resolution based on my need, however, I just wanted to know if there is a way to identify optimal range of clusters based on some statistical tool or definition, below (or above) which I am under (or over) clustering.

In the dataset I am analyzing, just like other I have seen in large dataset integration (for example, mouse nervous system atlas), the non-neuronal and some specific neuronal population are already clustered distinctly even at lower resolution . However, to define some distinct cell types, that we know exist based on experimental biology, in those big blobs (in tsne or umap) of cell types that are labelled as just excitatory or inhibitory neurons, what would have been a more algorithmic approach, rather than just saying i increased resolution until I got the cell types I wanted.

When I plot some known genes,for example that are involved in anterior posterior patterning, or follow temporal order during cell state specification, I see that they show spatial distribution in one cluster, so based on my prior knowledge of biology, I want that cluster to be separate. Since it wasn't, I felt may be I am under clustering. Second, I also want to subcluster the big clusters that were obtained in step one, to further dissect those big cluster of specific neuronal types into probably postional or temporal cell types? But someone could say, why didn't I just increase the resolution in the big cluster itself, rather than subclustering? For that I wanted to argue that statistically we identified those clusters at the gross level and they are "optimal", hence to further identify cell types, we sub-clustered the major clusters, using a recursive method (probably just two recursive levels). Hope my argument is not too confusing.

1
Entering edit mode

Thanks Aaron for your comments and congrats for having a house with a nice view in SF, hope you can see the golden gate bridge too with appropriate parameters.

Only the Bay bridge, I'm afraid. Couldn't bear to fork out an extra 1K per month to get a similarly good view on the other side.

however, I just wanted to know if there is a way to identify optimal range of clusters based on some statistical tool or definition, below (or above) which I am under (or over) clustering.

If you insist, you could use bootstrapCluster() to determine whether your clusters are stable. High coassignment probabilities mean that you're overclustering to the point that the clusters are not robust to sampling noise. However, this is probably less useful than it sounds; if the clusters are so small that they can't even handle sampling noise, the resolution is probably for too high for meaningful interpretation.

Here's another way of looking at it: imagine that you have a perfect assignment of cells to their true types. (For argument's sake, let's work with T cells here.) Now you need to write up the results. What kind of resolution is relevant to you? CD4-positive vs CD8-positive? Memory vs effector? Mature vs immature? Some-combination-of-CD-positives vs the-opposite-combination-of-CD-negatives? This is inherently a human decision - if you can make it, then you can just choose the clustering resolution to match, but if you can't make it, no algorithm can help you.

However, to define some distinct cell types, that we know exist based on experimental biology, in those big blobs (in tsne or umap) of cell types that are labelled as just excitatory or inhibitory neurons, what would have been a more algorithmic approach, rather than just saying i increased resolution until I got the cell types I wanted.

And... what's wrong with that? I think that's totally fine.

Let me predict what will happen with these algorithmic approaches. You'll try one of them - it may or may not give satisfactory results. You'll then think, "Oh, better check another approach." So you'll try another one and it will most likely give different results to the first, due to differences in its parameter settings or assumptions. Repeat the process with a whole bunch of methods and you'll end up with a whole bunch of different clusterings... from which you will choose one clustering that seems to be the most interesting/sensible/whatever you're looking for.

So if you're going to be "subjective" anyway, you might as well be upfront about it; adjust the resolution directly and save yourself a round-trip through all those algorithms. In fact, I don't think this is any more "subjective" than other day-to-day scientific decisions. For example, why do we summarize expression at the gene level? Why not work at the transcript level? Why not at the exon level? Hell, why not go down to kmer-level counts? The answer is because some of these are more interpretable than others, and that's the same thing with cluster resolution.

But someone could say, why didn't I just increase the resolution in the big cluster itself, rather than subclustering?

Sure you could, there's many ways to cut this cake. However, subclustering does offer the benefit of redefining the set of features (HVGs and PCs) to improve resolution within the big cluster, avoiding noise from irrelevant genes used to separate the big cluster from other clusters.

For that I wanted to argue that statistically we identified those clusters at the gross level and they are "optimal", hence to further identify cell types, we sub-clustered the major clusters, using a recursive method (probably just two recursive levels).

Hopefully you can see the contradiction here. If you're intending to subcluster, then you're conceding that the first round of clustering is not of sufficiently high resolution to capture the relevant heterogeneity. Which is perfectly acceptable to say and doesn't compromise the analysis at all, but you can't then turn around and claim that the first round was optimal.

1
Entering edit mode
@kevin
Last seen 6 hours ago
Republic of Ireland

The answer is that there is no answer, I think - the determination of the optimal number of clusters in a dataset is a difficult and subjective task, one whose outcome depends on quite a number of factors. At one end of the spectrum, in a single cell situation, every single cell should be its own cluster due to the fact that each will have its own [slightly unique] transcriptional programme; at the other end, the entire dataset could be regarded as a single cluster. In between these two extremes, there is no single metric that can identify the 'best' value for k, and its difficult to even elaborate on what 'best' means, in this situation.

First, regarding the Gap statistic, I began using this metric 'way back' in 2012 and had little luck with it. Interpreting the ideal k from the output is again a very subjective task, but one that I eventually managed to employ in this publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5611786/, and elaborated on Biostars: https://www.biostars.org/p/281161/#281217 There, we regarded the ideal k as a consensus between the Gap statistic, the silhouette coefficient, and the Elbow method.

If you first want to identify an optimal number of PCs prior to clustering, Aaron has thankfully / kindly implemented this functionality in PCAtools: 4.1 Determine optimum number of PCs to retain

The other thing I thought was to run a k-mean cluster to identify optimal number of cluster in the integrated single cell dataset, and then play with the resolution parameter until I get clusters around that identified based on k-mean clustering.

This goes back to my original point that there is no right or wrong answer. The resolution parameter controls the 'graininess' of the clusters, with no particular answer being right or wrong. What you may be doing 'wrong' in this situation is actually assuming, without other independent lines of evidence, that the k identified via k-means is the best for the data.

The approach that I took in a package in devel, but on CyTOF data, was to identify an optimum number of PCs via the PCAtools metrics mentioned above, perform UMAP on these, and then use Seurat's k-NN algorithm (with Louvain) to identify some cluster: For CyTOF, though, the parameter config. is quite a bit different due to the different data distribution.

[source: https://github.com/kevinblighe/scDataviz#find-ideal-clusters-in-the-umap-layout-via-k-nearest-neighbours]

## ----------------------

For what it's worth, that cluster::clusGap() function can be terribly slow; so, I enabled it for parallel processing:

Kevin

1
Entering edit mode

Thanks Kevin for your input. As I mentioned in my response to Aaron, I wanted to understand how to get optimal clusters non-subjectively, but I guess subjective resolution is accepted in the field. Thank you for the link and paper, I will read it to understand the approach for optimal k-means cluster.

I also wanted to mention that in my current analysis I am integrating different datasets using fastMNN which returns the corrected dimensions. I asked Aaron in another post on github if there is a way to know the optimal corrected dimensions to be used for further steps just like PCA, and he mentioned in newer updates he will include a value of variance that can be used to identify that. However, I played with different corrected dimension in a range of 30-50, keeping everything else unchanged, and didn't find any drastic difference in UMAP representation or number of clustered called. But the resolution parameter included the clustering a lot and I just wanted to know if its optimal value could also be decided.