Hello everyone, I am working on a scRNA-Seq data with multiple clustering algorithms. I want to see the quality of clustering with Silhouette plot function of "cluster" package. As I know, the function requires a distance matrix. Now my data is stored in Seurat object (can be transformed to SingleCellExperiment) and clustering with Seurat method provides an SNN matrix. Which option would be better to use?
1) Using SNN matrix as distance matrix
2) Calculating a new distance matrix with dist() function. (Actually I tried with Euclidead distance but did not work. Plots look amazingly low quality). My data is really big so I am sure it will take so much time. I can create the distance matrix with principle components (let's say first 50 PCs).
Thank you in advance.
Thank you Aaron, This function is looking so useful. I believe the SNN matrix is a similarity matrix. I tried silhouette plots before but the results were always so bad (-1). Now I see the reason. I just have a small question. Does buildSNNGraph uses Euclidean distance or Jaccard index as a measure?
Thank you again.
I think you have your ideas mixed up here. To clarify:
buildSNNGraph
uses the Euclidean distance to identify pairs of cells with shared neighbours. It creates a link between these paired cells, weighted based on the maximum average rank of the set of shared neighbours. (That is, a pair of cells that share their closest neighbour will have a high-weight link, while a pair of cells that only share their furthest neighbour will have a zero-weight link.) This is as described by Xu and Su (2015), see the References mentioned in?buildSNNGraph
.You can also set
type="number"
, which will define weights for each link based on the number of shared nearest neighbors. This ignores the ranking of the shared neighbours entirely, and is closest to the "Jaccard clustering" done by Seurat. I don't use this much, mostly because I'm lazy and this setting is not the default.There is no choice between Euclidean distances and Jaccard indices (not that the latter is ever used, anyway). If you like, you can switch to Manhattan distances for the NN search - at least in the BioC-devel branch - by setting
BNPARAM=KmknnParam(distance="manhattan")
. But I don't see a strong reason to do this.Hello Aaron, Thank you for your answer. You are right I am little confused. BuildSNNGraph uses Euclidean distance but the SNN matrix is a similarity matrix I believe (that's why silhouetteplots with that matrix does not work and you added warning to tutorial "We do not use the silhouette coefficient to assess clustering for large datasets. This is because cluster::silhouette requires the construction of a distance matrix, which may not be feasible when many cells are involved"). Then I will try type parameter and check the quality of clustering with modularity() function and consider the highest modularity as the best graph. I also want to use cluster_louvain() function of cluster package with the igraph object (SNN matrix). I am sorry for many questions and confused comments.
Thanks in advance