Question: Modularity values of clusters and Differential Expression in scRNA-Seq
0
12 weeks ago by
hamza_karakurt30 wrote:

Hello everyone, I have several questions about scRNA-Seq data analysis. I am using Scater/Scran packages for analysis.

As I know modularity (provided by igraph package) shows a quality measure for clustering. I used clusterwalktrap and clusterlouven algorithms after SNN and KNN graphs and found modularity scores of each clustering. All clusters have modularity more than 0.85 but I am wondering is there a threshold for modularity. I think higher modularity shows better clustering quality but in some cases (I tried subclustering or specific clusters), modularity is about 0.4-0.5. Can we say they are low quality clusters?

Second, experiments from different batches can be corrected with MNN but we cannot (actually we shouldn't) use corrected values in DE testing since they are not correspond to gene values. As a question; we filter and normalize data from different batches seperately and independently (as in tutorials), can we use their logcounts of data from different conditions (such as healthy and disease) for differential expression analysis before we use MNN?

igraph scran scater scrna-seq rots • 128 views
modified 12 weeks ago by Aaron Lun23k • written 12 weeks ago by hamza_karakurt30
Answer: Modularity values of clusters and Differential Expression in scRNA-Seq
1
12 weeks ago by
Aaron Lun23k
Cambridge, United Kingdom
Aaron Lun23k wrote:

Can we say they are low quality clusters?

You could say that; I prefer the term "poorly separated". However, "poorly separated" does not mean "useless". For me, clustering is just a way of breaking up the data set into parts that can be comprehended. A useful clustering procedure will deliver parts (i.e., clusters) that relate to biological concepts like cell types or status or whatever, which gives us something concrete to think about when we're trying to understand the data. If you treat clustering as a tool in this manner, then even poorly separated clusters are useful if they break up a big blob of cells into easily digestible chunks.

You might think that poorly separated clusters are less likely to be "real". But clusters are inherently empirically defined, so there's not much meaning to discussing whether they are real or not. The more pertinent question is whether two poorly-separated clusters are derived from the same underlying biological aspect or whether they correspond to different aspects, e.g., different cell types. Answering this question requires some strong assumptions about how these aspects manifest in the data, like "cells are normally distributed around an average expression profile for each cell type" (e.g., k-means). I do not find this line of investigation to be particularly interesting, and would rather spend my efforts on achieving a useful clustering that yields some hypotheses for experimental validation.

You might also think that poorly separated clusters are less stable in the sense that, under slightly different circumstances, they will merge with neighbouring clusters. This is a valid concern, but more from a perspective of logistics - it's annoying to have to re-find a poorly separated cluster if it keeps on merging with its neighbours every time you change a parameter in the upstream steps of your analysis. However, the cluster hasn't "disappeared" - the cells that make it up are still there, it's just the way you're summarizing the data that has changed.

So, sure, I would be more inclined to work on well-separated clusters, but only because it's easier. There is usually some important biology that occurs in poorly separated clusters, so it would be silly to dismiss them out of hand.

can we use their logcounts of data from different conditions (such as healthy and disease) for differential expression analysis before we use MNN?

Yes. See for example here for a pseudo-bulk analysis. The workflows here demonstrate how to do this for each cell type after clustering on the MNN-corrected values. (Note that only the clustering is done on the corrected values, the DE analysis is done on the counts!)

Thank you Aaron, So basically, poorly-seperated is not useless and still can keep valuable information and among 5-6 clusters, choosing the one with the highest modularity means using the "best-seperated" one. And thank you for the information about DE analysis. Since the MNN correction does not effect the logcounts, we can do DE analysis to different data from different batches without using batch correction. But I am wondering, counts must have a batch effect so we kind of ignore it?

As a last question about scater; findMarkers function generates possible markers for each cluster but in the results, what is the actual meaning of column "Top"?

Since the MNN correction does not effect the logcounts, we can do DE analysis to different data from different batches without using batch correction. But I am wondering, counts must have a batch effect so we kind of ignore it?

I'm not entirely sure what you mean by this, so I'll just say these things:

• Do not perform your DE analyses on the corrected values. Instead, use the uncorrected values and block on the batch in your design matrix.
• Both the log-transformed counts and the raw counts contain batch effects. I have used the log-counts in the DE analysis only for speed and convenience; you could alternatively use edgeR on the counts, provided you block on the batch effect.

what is the actual meaning of column "Top"?

See the section "Consolidating p-values into a ranking" in ?combineMarkers (which is called by findMarkers). Basically, if you take the set of genes with Top=1, this is the same as taking the top DE gene from each pairwise comparison between your cluster of interest and every other cluster. The idea is to use a set of genes to define the cluster, rather than relying on a single marker gene that is DE between the current cluster and all other clusters. The latter scenario may not even exist, see my CD4/CD8 comments.

Thank you Aaron,

"Since the MNN correction does not effect the logcounts, we can do DE analysis to different data from different batches without using batch correction. But I am wondering, counts must have a batch effect so we kind of ignore it?"

Actually I am using some external methods such as ROTS for differential expression analysis and I use logcounts. My question is: this method does not any parameter about batch, using these kinds of methods wrong or acceptable?

1

If you must use a method that cannot consider batch terms, you should perform a meta-analysis instead. That is, perform differential expression between cell types or clusters within each batch (where, by definition, there is no batch effect), and combine the p-values across batches. However, if your comparisons are confounded with batch, then you're in trouble. If this is the case, performing the DE analysis on the corrected values will not help as your experimental design is fundamentally flawed.