8 months ago by
Cambridge, United Kingdom
As James says, the correlation between expression profiles of two libraries is rarely interesting in and of itself. More useful is the relative size of the correlation between different pairs. For example, is the correlation between samples in the same group greater than the correlation between samples in different groups? This reasoning motivates the construction of a MDS plot, as demonstrated in the code below:
stuff <- matrix(rnorm(1000), ncol=10) # 100 genes, 10 samples
cor.mat <- cor(stuff)
dist.mat <- sqrt(0.5*(1-cor.mat)) # magic step
coords <- cmdscale(dist.mat, k=2)
The magic step converts the pairwise correlation into a valid distance metric (see https://arxiv.org/abs/1208.3145 for details). The use of correlations has a couple of advantages over just using the Euclidean distances between the log-expression profiles:
- Correlations are insensitive to scaling, so you don't have to worry about normalising the data.
- Spearman's rho is robust to outliers that might interfere with estimation of relative differences between samples - though on the flip side, it is less sensitive to differences between libraries that do not involve many genes.
This is why we sometimes use correlation-based (i.e., cosine) distances in single-cell RNA-seq, instead of running PCA on the log-expression profiles or computing Euclidean distances between pairs of profiles for use in "standard" MDS.
As for your other question, I would calculate correlations from log-expression values. The log-transformation avoids domination of the estimated correlation by a handful of genes with large counts. Normalisation doesn't matter so much as the computed correlation isn't affected by scaling (aside from small changes with respect to the prior count). In fact, you could even use the original counts for computing Spearman's rho, which will be unaffected by general monotonic transformations of the data.
modified 8 months ago
8 months ago by
Aaron Lun • 17k