Question: Scaling genes to have unit variance for clustering / heatmaps with DESeq2
1
tj.hu10 wrote:

Hi,

I'm trying to cluster rows and columns in DESeq2 and show it in a heatmap. I've tried this:

heatmap.2(assay(vst)[select,], col = hmcol,
Rowv = T, Colv = T, scale="row", trace="none", margin=c(8, 8))

I think my problem is revealed when the scale = 'row', the genes don't seem clustered, yet when scale='none' I see high medium and lowly expressed genes in separate clusters. I think this means that the rows are not normalized, while clustering columns, each column is normalized, right?

I think, when the rows aren't normalized the counts that are low tend to cluster to each other and high counts would cluster each other. For example:

geneW, Sample A = 10, B=20

geneX, Sample A = 15, B=15

geneY, Sample A = 1000,B=2000

GeneZ, Sample A=1500,B15000

if each gene is normalized, W and X are closer to Y and Z. But in realized W changes more similar with Y; and W remains constant like with Z

Is my interpretation correct? How would I normalize the rows?

Thanks!

Answer: Biclustering Normalizing by Row in Heatmap of DESeq2
3
Michael Love25k wrote:

We do not recommend dividing out the standard deviation from each row/gene for plots which will involve clustering or ordination of samples and here's why:

We have already (approximately) stabilized the variance such that each gene has an equal shot at contributing to the distance sum regardless of its mean. Another way to say this is that we've removed systematic differences in gene variance over the dynamic range. Note that, at this point, subtracting the mean of each gene has no effect on distance calculation so we can ignore this step (often useful to subtract mean for visualization/heatmaps).

Now take the following data as an example, one gene with large DE across condition and two genes with no DE, just some biological and technical variation:

> cond <- factor(c("a","a","a","b","b","b"))
> mat <- matrix(c(1,1,1,100,100,100,rep(1:2,6)),ncol=6,byrow=TRUE)
> mat
[,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    1    1  100  100  100
[2,]    1    2    1    2    1    2
[3,]    1    2    1    2    1    2

Here's the distance matrix for the variance stabilized data:

> round(dist(t(mat)),1)
1    2    3    4    5
2  1.4
3  0.0  1.4
4 99.0 99.0 99.0
5 99.0 99.0 99.0  1.4
6 99.0 99.0 99.0  0.0  1.4

Note: small distances within each condition, large distances across condition. This is what we'd want to show up in a heatmap, hierarchical clustering or PCA plot.

Now after scaling each row, the data looks like this:

> sc.mat <- t(apply(mat, 1, scale))

> round(sc.mat, 2)
[,1]  [,2]  [,3] [,4]  [,5] [,6]
[1,] -0.91 -0.91 -0.91 0.91  0.91 0.91
[2,] -0.91  0.91 -0.91 0.91 -0.91 0.91
[3,] -0.91  0.91 -0.91 0.91 -0.91 0.91

Now the two genes with no signal are on the same scale as the gene which helped us pick apart the samples.

And if we calculate distances, we can see we've lost the signal which picks apart the condition:

> round(dist(t(sc.mat)), 1)
1   2   3   4   5
2 2.6
3 0.0 2.6
4 3.2 1.8 3.2
5 1.8 3.2 1.8 2.6
6 3.2 1.8 3.2 0.0 2.6

ADD COMMENTlink modified 2.1 years ago • written 4.1 years ago by Michael Love25k
Answer: Biclustering Normalizing by Row in Heatmap of DESeq2
0
b.nota340 wrote:

As far as I know the scale options has to do with the colors only. With scale="row" you divide all values by the mean of the row.

Don't you think you should use e.g., RPKM values for heatmaps anyways?

In the DESeq2 workflow, we recommend stabilizing the variance of observed counts first (the FPKM values don't tell you about their precision). Then, as you suggest, we also suggest to remove the mean, if the point is to cluster genes which show similar relative changes across samples (where the absolute count or absolute expression level is not as important as relative differences).

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Michael Love25k
Answer: Biclustering Normalizing by Row in Heatmap of DESeq2
0
Michael Love25k wrote:

I suggest to remove the row mean manually. See the "Gene clustering" example here:

http://master.bioconductor.org/help/workflows/rnaseqGene/#plots

Answer: Biclustering Normalizing by Row in Heatmap of DESeq2
0
deut1370 wrote:

Besides removing mean, in some cases it might be useful to use z-scores of the gene profiles in the data matrix:

data.t<-t(data)
data.scaled<-t(scale(data.t))

If you use heatmap.2 you should input scaled data and use scale="none" as the source code shows that data are scaled after clustering for visualization.

I have an argument for why not to use gene z-scores in a post below.

It is a good argument. I was thinking of finding genes with similar 'shape', so that in a case like

mat <- matrix(c(1,1,1,10,10,10,10,10,10,100,100,100,rep(1:2,6)),ncol=6,byrow=TRUE) one would separate genes 1-2 from 3-4.