Search
Question: plotting correlation between samples
0
18 months ago by
teresati10
teresati10 wrote:

Dear all,

I would like to plot the correlation between my samples. I have a matrix with genes in rows and samples in columns, and for each gene a logCPM normalized values. If I want to look at the correlation between 2 samples I get an Error:

x <- as.data.frame(t(log.cpm.norm[,1]))
y <- as.data.frame(t(log.cpm.norm[,2]))
correlation <- as.data.frame(cor(x, y,use="pairwise.complete.obs"))
plot(correlation)

Error in plot.new() : figure margins too large
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf

It is correct to transpose the columns?

Furthermore, are log cpms usually used to compute correlations between samples?

To solve the margin problem I have tried to adjust the margins with:

par('mar')

Thank you for any suggestion.

Regards

modified 18 months ago by Aaron Lun21k • written 18 months ago by teresati10
2

What exactly are you trying to do? You are producing a matrix of the pairwise correlations between two single observations, which will be NA for all of those pairs (you can't compute a correlation between just two observations). You could hypothetically compute the correlation between the two samples, but that would be a single number. Neither of these things is interesting, or illuminating.

Are you perhaps trying to show which samples are more similar to each other? If so, the main tool is either MDS or PCA, both of which are useful for showing that sort of thing.

You might want to spend some time reading some tutorials, which could help you get on track. Here is one, if you are using DESeq2. You could also use the DESeq2 vignette, or if you are using edgeR, there is a User's Guide for that package.

2
18 months ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

As James says, the correlation between expression profiles of two libraries is rarely interesting in and of itself. More useful is the relative size of the correlation between different pairs. For example, is the correlation between samples in the same group greater than the correlation between samples in different groups? This reasoning motivates the construction of a MDS plot, as demonstrated in the code below:

stuff <- matrix(rnorm(1000), ncol=10) # 100 genes, 10 samples
cor.mat <- cor(stuff)
dist.mat <- sqrt(0.5*(1-cor.mat)) # magic step
coords <- cmdscale(dist.mat, k=2)
plot(coords[,1], coords[,2])

The magic step converts the pairwise correlation into a valid distance metric (see https://arxiv.org/abs/1208.3145 for details). The use of correlations has a couple of advantages over just using the Euclidean distances between the log-expression profiles:

1. Correlations are insensitive to scaling, so you don't have to worry about normalising the data.
2. Spearman's rho is robust to outliers that might interfere with estimation of relative differences between samples - though on the flip side, it is less sensitive to differences between libraries that do not involve many genes.

This is why we sometimes use correlation-based (i.e., cosine) distances in single-cell RNA-seq, instead of running PCA on the log-expression profiles or computing Euclidean distances between pairs of profiles for use in "standard" MDS.

As for your other question, I would calculate correlations from log-expression values. The log-transformation avoids domination of the estimated correlation by a handful of genes with large counts. Normalisation doesn't matter so much as the computed correlation isn't affected by scaling (aside from small changes with respect to the prior count). In fact, you could even use the original counts for computing Spearman's rho, which will be unaffected by general monotonic transformations of the data.

1

Another way to plot the correlations, would be with a heatmap. To follow up on Aaron's code example:

require(gplots)

heatmap.2(cor.mat, trace="none")