Question

Modularity values of clusters and Differential Expression in scRNA-Seq

0

Entering edit mode

hamza_karakurt ▴ 60

@hamza_karakurt-17704

Last seen 3.5 years ago

Turkey

Hello everyone, I have several questions about scRNA-Seq data analysis. I am using Scater/Scran packages for analysis.

As I know modularity (provided by igraph package) shows a quality measure for clustering. I used clusterwalktrap and clusterlouven algorithms after SNN and KNN graphs and found modularity scores of each clustering. All clusters have modularity more than 0.85 but I am wondering is there a threshold for modularity. I think higher modularity shows better clustering quality but in some cases (I tried subclustering or specific clusters), modularity is about 0.4-0.5. Can we say they are low quality clusters?

Second, experiments from different batches can be corrected with MNN but we cannot (actually we shouldn't) use corrected values in DE testing since they are not correspond to gene values. As a question; we filter and normalize data from different batches seperately and independently (as in tutorials), can we use their logcounts of data from different conditions (such as healthy and disease) for differential expression analysis before we use MNN?

Thank you in advance.

scater scran igraph scRNA-Seq rots • 2.8k views

ADD COMMENT • link updated 7.0 years ago by Aaron Lun ★ 29k • written 7.0 years ago by hamza_karakurt ▴ 60

score 1 · Answer 1 · 2019-02-27

Can we say they are low quality clusters?

You could say that; I prefer the term "poorly separated". However, "poorly separated" does not mean "useless". For me, clustering is just a way of breaking up the data set into parts that can be comprehended. A useful clustering procedure will deliver parts (i.e., clusters) that relate to biological concepts like cell types or status or whatever, which gives us something concrete to think about when we're trying to understand the data. If you treat clustering as a tool in this manner, then even poorly separated clusters are useful if they break up a big blob of cells into easily digestible chunks.

You might think that poorly separated clusters are less likely to be "real". But clusters are inherently empirically defined, so there's not much meaning to discussing whether they are real or not. The more pertinent question is whether two poorly-separated clusters are derived from the same underlying biological aspect or whether they correspond to different aspects, e.g., different cell types. Answering this question requires some strong assumptions about how these aspects manifest in the data, like "cells are normally distributed around an average expression profile for each cell type" (e.g., k-means). I do not find this line of investigation to be particularly interesting, and would rather spend my efforts on achieving a useful clustering that yields some hypotheses for experimental validation.

You might also think that poorly separated clusters are less stable in the sense that, under slightly different circumstances, they will merge with neighbouring clusters. This is a valid concern, but more from a perspective of logistics - it's annoying to have to re-find a poorly separated cluster if it keeps on merging with its neighbours every time you change a parameter in the upstream steps of your analysis. However, the cluster hasn't "disappeared" - the cells that make it up are still there, it's just the way you're summarizing the data that has changed.

So, sure, I would be more inclined to work on well-separated clusters, but only because it's easier. There is usually some important biology that occurs in poorly separated clusters, so it would be silly to dismiss them out of hand.

can we use their logcounts of data from different conditions (such as healthy and disease) for differential expression analysis before we use MNN?

Yes. See for example here for a pseudo-bulk analysis. The workflows here demonstrate how to do this for each cell type after clustering on the MNN-corrected values. (Note that only the clustering is done on the corrected values, the DE analysis is done on the counts!)