Question

Optimal cluster number identification using buildSNNgraph and igraph clusters

0

Entering edit mode

p.joshi ▴ 40

@pjoshi-22718

Last seen 3.3 years ago

Germany

Hi again!

This time I am trying to figure out how can I determine if I have obtained optimal number of clusters using the buildSNNgraph and igraph approach. I can increase and decrease the resolution of clusters by changing parameter k which I know is the number of neighbors and not the number of centroids.

I recently read a method paper, IKAP, that is based on Seurat object and Seurat clustering functions, where it identifies optimal number PCs and resolution parameter to identify optimal clustering using gap statistics values. Now, as I am not such a good programmer, I don't want to take risk in modifying the code to fit the scran based functions.

In the tutorial on OSCA website, there is an example of finding optimal k for k-mean clustering using cluster::clusgap() function. I can also input my own clustering function there, which is also defined in another example.

gaps <- clusGap(com.sce, myClusterFUN, K.max=40)
myClusterFUN <- function(x) {
    g <- buildSNNGraph(x, use.dimred="corrected", type="jaccard")
    igraph::cluster_louvain(g)$membership
}

Now, when I try to run, I get an error about size allotment, as I am trying to run it on my laptop. So I don't even know if this approach will work. The other thing I thought was to run a k-mean cluster to identify optimal number of cluster in the integrated single cell dataset, and then play with the resolution parameter until I get clusters around that identified based on k-mean clustering.

So I am wondering if this is a legitimate strategy or am I committing a mistake of comparing apples (graph based clusters) to oranges (k-mean based clusters).

Thanks!

scran gap statistics single cell clustering • 4.5k views

ADD COMMENT • link updated 5.3 years ago by Aaron Lun ★ 29k • written 5.3 years ago by p.joshi ▴ 40

1

Entering edit mode

Kevin Blighe ★ 4.0k

@kevin

Last seen 8 days ago

The Cave, 181 Longwood Avenue, Boston, …

The answer is that there is no answer, I think - the determination of the optimal number of clusters in a dataset is a difficult and subjective task, one whose outcome depends on quite a number of factors. At one end of the spectrum, in a single cell situation, every single cell should be its own cluster due to the fact that each will have its own [slightly unique] transcriptional programme; at the other end, the entire dataset could be regarded as a single cluster. In between these two extremes, there is no single metric that can identify the 'best' value for k, and its difficult to even elaborate on what 'best' means, in this situation.

First, regarding the Gap statistic, I began using this metric 'way back' in 2012 and had little luck with it. Interpreting the ideal k from the output is again a very subjective task, but one that I eventually managed to employ in this publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5611786/, and elaborated on Biostars: https://www.biostars.org/p/281161/#281217 There, we regarded the ideal k as a consensus between the Gap statistic, the silhouette coefficient, and the Elbow method.

If you first want to identify an optimal number of PCs prior to clustering, Aaron has thankfully / kindly implemented this functionality in PCAtools: 4.1 Determine optimum number of PCs to retain

The other thing I thought was to run a k-mean cluster to identify optimal number of cluster in the integrated single cell dataset, and then play with the resolution parameter until I get clusters around that identified based on k-mean clustering.

This goes back to my original point that there is no right or wrong answer. The resolution parameter controls the 'graininess' of the clusters, with no particular answer being right or wrong. What you may be doing 'wrong' in this situation is actually assuming, without other independent lines of evidence, that the k identified via k-means is the best for the data.

The approach that I took in a package in devel, but on CyTOF data, was to identify an optimum number of PCs via the PCAtools metrics mentioned above, perform UMAP on these, and then use Seurat's k-NN algorithm (with Louvain) to identify some cluster: For CyTOF, though, the parameter config. is quite a bit different due to the different data distribution.

[source: https://github.com/kevinblighe/scDataviz#find-ideal-clusters-in-the-umap-layout-via-k-nearest-neighbours]

----------------------

For what it's worth, that cluster::clusGap() function can be terribly slow; so, I enabled it for parallel processing:

Kevin

ADD COMMENT • link 5.3 years ago Kevin Blighe ★ 4.0k

1

Entering edit mode

Thanks Kevin for your input. As I mentioned in my response to Aaron, I wanted to understand how to get optimal clusters non-subjectively, but I guess subjective resolution is accepted in the field. Thank you for the link and paper, I will read it to understand the approach for optimal k-means cluster.

I also wanted to mention that in my current analysis I am integrating different datasets using fastMNN which returns the corrected dimensions. I asked Aaron in another post on github if there is a way to know the optimal corrected dimensions to be used for further steps just like PCA, and he mentioned in newer updates he will include a value of variance that can be used to identify that. However, I played with different corrected dimension in a range of 30-50, keeping everything else unchanged, and didn't find any drastic difference in UMAP representation or number of clustered called. But the resolution parameter included the clustering a lot and I just wanted to know if its optimal value could also be decided.

ADD REPLY • link 5.3 years ago p.joshi ▴ 40

score 4 · Accepted Answer · 2020-05-06

This time I am trying to figure out how can I determine if I have obtained optimal number of clusters using the buildSNNgraph and igraph approach.

My apartment has an amazing view that gives me a panorama across San Francisco. If I stare out the window, I can see the apartment blocks, the major arteries going in and out of the city, the mountain ranges in the distance; if I stare a bit longer, I start to notice the smaller buildings, the main roads, the parks; and if I continue watching, I can make out trees, vehicles, pedestrians and the occasional dog.

Which one of those views is correct? Well, what do you care about? I'm more of a landscape guy, so I like to sit back in the comfort of my living room and enjoy the vista in front of me. ("Everything the light touches...", etc.) But that might not be the case in other contexts. If I had to deliver something, I would need to care about the finer detail of the city's layout, and if I was waiting for someone, I would need to keep an eye on all the people passing by.

Now, I'm not (just) bragging about my apartment here, because this serves as a good metaphor for clustering resolution. Do you care about the major cell types? Do you want to cut them up into subtypes? Do you want to cut them further into different metabolic/stress/activation states? The data doesn't exist in a vacuum, so adjust the resolution until you get clusters that can help you answer your scientific question.

You might feel that's an unsatisfying answer. "If there's no right or wrong clustering, I could just try a whole bunch of different clusterings and choose the one that I like the best," you might say. Well, yes! You can do that! The vast majority of single-cell data analysis is exploratory and, in my opinion, should be aiming to generate hypotheses. It's fine to play fast and loose here to get a piece of useful information from the dataset, as long as you are willing to test the subsequent hypotheses with an independent experiment.

I recently read a method paper, IKAP, that is based on Seurat object and Seurat clustering functions, where it identifies optimal number PCs and resolution parameter to identify optimal clustering using gap statistics values.

At the risk of sounding like a grumpy old man, I would say that the optimal number of clusters is the number of clusters that you and/or your collaborators can be bothered to annotate and/or validate.

Now, when I try to run, I get an error about size allotment, as I am trying to run it on my laptop. So I don't even know if this approach will work.

Good lord, don't try to run it on a real dataset. You will see that my comments on k-means in the book are rather equivocal, because I don't think it's a good frontline clustering method that yields interpretable results. It is, however, a good data compression approach that can be used in conjunction with other approaches to speed them up. (Note that, when I'm talking about speed, I'm referring to kmeans not clusGap; the latter is only necessary if you were aiming to use k-means as your frontline method.)

So I am wondering if this is a legitimate strategy or am I committing a mistake of comparing apples (graph based clusters) to oranges (k-mean based clusters).

Most graph-based clustering algorithms have a natural way of choosing the "optimal" number of clusters, via maximization of the modularity score. This is a metric that - well, read the book. Now, I quoted "optimal" above because this may not have much relevance to what is best for your situation. The modularity is an arbitrary measure that has several different definitions, while the nearest neighbor graph is an artificial construct that depends on some choice of k and some definition of the weighting of shared neighbors.