18 months ago by

Cambridge, United Kingdom

As James says, the correlation between expression profiles of two libraries is rarely interesting in and of itself. More useful is the relative size of the correlation between different pairs. For example, is the correlation between samples in the same group greater than the correlation between samples in different groups? This reasoning motivates the construction of a MDS plot, as demonstrated in the code below:

stuff <- matrix(rnorm(1000), ncol=10) # 100 genes, 10 samples
cor.mat <- cor(stuff)
dist.mat <- sqrt(0.5*(1-cor.mat)) # magic step
coords <- cmdscale(dist.mat, k=2)
plot(coords[,1], coords[,2])

The magic step converts the pairwise correlation into a valid distance metric (see https://arxiv.org/abs/1208.3145 for details). The use of correlations has a couple of advantages over just using the Euclidean distances between the log-expression profiles:

- Correlations are insensitive to scaling, so you don't have to worry about normalising the data.
- Spearman's rho is robust to outliers that might interfere with estimation of relative differences between samples - though on the flip side, it is less sensitive to differences between libraries that do not involve many genes.

This is why we sometimes use correlation-based (i.e., cosine) distances in single-cell RNA-seq, instead of running PCA on the log-expression profiles or computing Euclidean distances between pairs of profiles for use in "standard" MDS.

As for your other question, I would calculate correlations from log-expression values. The log-transformation avoids domination of the estimated correlation by a handful of genes with large counts. Normalisation doesn't matter so much as the computed correlation isn't affected by scaling (aside from small changes with respect to the prior count). In fact, you could even use the original counts for computing Spearman's rho, which will be unaffected by general monotonic transformations of the data.

•

link
modified 18 months ago
•
written
18 months ago by
Aaron Lun • **21k**
What exactly are you trying to do? You are producing a matrix of the pairwise correlations between two single observations, which will be NA for all of those pairs (you can't compute a correlation between just two observations). You could hypothetically compute the correlation between the two samples, but that would be a single number. Neither of these things is interesting, or illuminating.

Are you perhaps trying to show which samples are more similar to each other? If so, the main tool is either MDS or PCA, both of which are useful for showing that sort of thing.

You might want to spend some time reading some tutorials, which could help you get on track. Here is one, if you are using DESeq2. You could also use the DESeq2 vignette, or if you are using edgeR, there is a User's Guide for that package.

47k