Question: Scaling genes to have unit variance for clustering / heatmaps with DESeq2
gravatar for
2.9 years ago by
United States
tj.hu10 wrote:


I'm trying to cluster rows and columns in DESeq2 and show it in a heatmap. I've tried this:

heatmap.2(assay(vst)[select,], col = hmcol,
          Rowv = T, Colv = T, scale="row", trace="none", margin=c(8, 8))


I think my problem is revealed when the scale = 'row', the genes don't seem clustered, yet when scale='none' I see high medium and lowly expressed genes in separate clusters. I think this means that the rows are not normalized, while clustering columns, each column is normalized, right?

I think, when the rows aren't normalized the counts that are low tend to cluster to each other and high counts would cluster each other. For example:

geneW, Sample A = 10, B=20

geneX, Sample A = 15, B=15

geneY, Sample A = 1000,B=2000

GeneZ, Sample A=1500,B15000

if each gene is normalized, W and X are closer to Y and Z. But in realized W changes more similar with Y; and W remains constant like with Z

Is my interpretation correct? How would I normalize the rows?



ADD COMMENTlink modified 8 months ago by Michael Love18k • written 2.9 years ago by tj.hu10
gravatar for Michael Love
2.8 years ago by
Michael Love18k
United States
Michael Love18k wrote:

We do not recommend dividing out the standard deviation from each row/gene for plots which will involve clustering or ordination of samples and here's why:

We have already (approximately) stabilized the variance such that each gene has an equal shot at contributing to the distance sum regardless of its mean. Another way to say this is that we've removed systematic differences in gene variance over the dynamic range. Note that, at this point, subtracting the mean of each gene has no effect on distance calculation so we can ignore this step (often useful to subtract mean for visualization/heatmaps).

Now take the following data as an example, one gene with large DE across condition and two genes with no DE, just some biological and technical variation:

> cond <- factor(c("a","a","a","b","b","b"))
> mat <- matrix(c(1,1,1,100,100,100,rep(1:2,6)),ncol=6,byrow=TRUE)
> mat
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    1    1  100  100  100
[2,]    1    2    1    2    1    2
[3,]    1    2    1    2    1    2

Here's the distance matrix for the variance stabilized data:

> round(dist(t(mat)),1)
     1    2    3    4    5
2  1.4
3  0.0  1.4
4 99.0 99.0 99.0
5 99.0 99.0 99.0  1.4
6 99.0 99.0 99.0  0.0  1.4

Note: small distances within each condition, large distances across condition. This is what we'd want to show up in a heatmap, hierarchical clustering or PCA plot.

Now after scaling each row, the data looks like this:

> sc.mat <- t(apply(mat, 1, scale))

> round(sc.mat, 2)
      [,1]  [,2]  [,3] [,4]  [,5] [,6]
[1,] -0.91 -0.91 -0.91 0.91  0.91 0.91
[2,] -0.91  0.91 -0.91 0.91 -0.91 0.91
[3,] -0.91  0.91 -0.91 0.91 -0.91 0.91

Now the two genes with no signal are on the same scale as the gene which helped us pick apart the samples.

And if we calculate distances, we can see we've lost the signal which picks apart the condition:

> round(dist(t(sc.mat)), 1)
    1   2   3   4   5
2 2.6
3 0.0 2.6
4 3.2 1.8 3.2
5 1.8 3.2 1.8 2.6
6 3.2 1.8 3.2 0.0 2.6


ADD COMMENTlink modified 10 months ago • written 2.8 years ago by Michael Love18k
gravatar for b.nota
2.9 years ago by
b.nota290 wrote:

As far as I know the scale options has to do with the colors only. With scale="row" you divide all values by the mean of the row.

Don't you think you should use e.g., RPKM values for heatmaps anyways?

ADD COMMENTlink written 2.9 years ago by b.nota290

In the DESeq2 workflow, we recommend stabilizing the variance of observed counts first (the FPKM values don't tell you about their precision). Then, as you suggest, we also suggest to remove the mean, if the point is to cluster genes which show similar relative changes across samples (where the absolute count or absolute expression level is not as important as relative differences).

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Michael Love18k
gravatar for Michael Love
2.9 years ago by
Michael Love18k
United States
Michael Love18k wrote:

I suggest to remove the row mean manually. See the "Gene clustering" example here:

ADD COMMENTlink written 2.9 years ago by Michael Love18k
gravatar for deut13
2.8 years ago by
deut1370 wrote:

Besides removing mean, in some cases it might be useful to use z-scores of the gene profiles in the data matrix:


If you use heatmap.2 you should input scaled data and use scale="none" as the source code shows that data are scaled after clustering for visualization.

ADD COMMENTlink written 2.8 years ago by deut1370

I have an argument for why not to use gene z-scores in a post below.

ADD REPLYlink written 2.8 years ago by Michael Love18k

It is a good argument. I was thinking of finding genes with similar 'shape', so that in a case like

mat <- matrix(c(1,1,1,10,10,10,10,10,10,100,100,100,rep(1:2,6)),ncol=6,byrow=TRUE) one would separate genes 1-2 from 3-4.

ADD REPLYlink written 2.8 years ago by deut1370

ah I see. Yes if you are only concerned in comparing genes, then this could be useful.

ADD REPLYlink written 2.8 years ago by Michael Love18k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 115 users visited in the last hour