Question

Scaling genes to have unit variance for clustering / heatmaps with DESeq2

1

Entering edit mode

tj.hu ▴ 10

@tjhu-8369

Last seen 8.6 years ago

United States

Hi,

I'm trying to cluster rows and columns in DESeq2 and show it in a heatmap. I've tried this:

heatmap.2(assay(vst)[select,], col = hmcol,
Rowv = T, Colv = T, scale="row", trace="none", margin=c(8, 8))

I think my problem is revealed when the scale = 'row', the genes don't seem clustered, yet when scale='none' I see high medium and lowly expressed genes in separate clusters. I think this means that the rows are not normalized, while clustering columns, each column is normalized, right?

I think, when the rows aren't normalized the counts that are low tend to cluster to each other and high counts would cluster each other. For example:

geneW, Sample A = 10, B=20

geneX, Sample A = 15, B=15

geneY, Sample A = 1000,B=2000

GeneZ, Sample A=1500,B15000

if each gene is normalized, W and X are closer to Y and Z. But in realized W changes more similar with Y; and W remains constant like with Z

Is my interpretation correct? How would I normalize the rows?

Thanks!

deseq2 heatmap.2 bicluster pheatmap • 7.9k views

ADD COMMENT • link updated 6.5 years ago by Michael Love 41k • written 8.6 years ago by tj.hu ▴ 10

score 3 · Answer 1 · 2015-09-10

We do not recommend dividing out the standard deviation from each row/gene for plots which will involve clustering or ordination of samples and here's why:

We have already (approximately) stabilized the variance such that each gene has an equal shot at contributing to the distance sum regardless of its mean. Another way to say this is that we've removed systematic differences in gene variance over the dynamic range. Note that, at this point, subtracting the mean of each gene has no effect on distance calculation so we can ignore this step (often useful to subtract mean for visualization/heatmaps).

Now take the following data as an example, one gene with large DE across condition and two genes with no DE, just some biological and technical variation:

> cond <- factor(c("a","a","a","b","b","b"))

> mat <- matrix(c(1,1,1,100,100,100,rep(1:2,6)),ncol=6,byrow=TRUE)
> mat
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    1    1  100  100  100
[2,]    1    2    1    2    1    2
[3,]    1    2    1    2    1    2

Here's the distance matrix for the variance stabilized data:

> round(dist(t(mat)),1)
     1    2    3    4    5
2  1.4
3  0.0  1.4
4 99.0 99.0 99.0
5 99.0 99.0 99.0  1.4
6 99.0 99.0 99.0  0.0  1.4

Note: small distances within each condition, large distances across condition. This is what we'd want to show up in a heatmap, hierarchical clustering or PCA plot.

Now after scaling each row, the data looks like this:

> sc.mat <- t(apply(mat, 1, scale))

> round(sc.mat, 2)
      [,1]  [,2]  [,3] [,4]  [,5] [,6]
[1,] -0.91 -0.91 -0.91 0.91  0.91 0.91
[2,] -0.91  0.91 -0.91 0.91 -0.91 0.91
[3,] -0.91  0.91 -0.91 0.91 -0.91 0.91

Now the two genes with no signal are on the same scale as the gene which helped us pick apart the samples.

And if we calculate distances, we can see we've lost the signal which picks apart the condition:

> round(dist(t(sc.mat)), 1)
    1   2   3   4   5
2 2.6
3 0.0 2.6
4 3.2 1.8 3.2
5 1.8 3.2 1.8 2.6
6 3.2 1.8 3.2 0.0 2.6

score 0 · Answer 2 · 2015-09-08

0

Entering edit mode

b.nota ▴ 360

@bnota-7379

Last seen 3.6 years ago

Netherlands

As far as I know the scale options has to do with the colors only. With scale="row" you divide all values by the mean of the row.

Don't you think you should use e.g., RPKM values for heatmaps anyways?

ADD COMMENT • link 8.6 years ago b.nota ▴ 360

0

Entering edit mode

In the DESeq2 workflow, we recommend stabilizing the variance of observed counts first (the FPKM values don't tell you about their precision). Then, as you suggest, we also suggest to remove the mean, if the point is to cluster genes which show similar relative changes across samples (where the absolute count or absolute expression level is not as important as relative differences).

ADD REPLY • link 8.6 years ago Michael Love 41k

score 0 · Answer 3 · 2015-09-08

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 hour ago

United States

I suggest to remove the row mean manually. See the "Gene clustering" example here:

http://master.bioconductor.org/help/workflows/rnaseqGene/#plots

ADD COMMENT • link 8.6 years ago Michael Love 41k

score 0 · Answer 4 · 2015-09-10

0

Entering edit mode

deut13 ▴ 80

@deut13-8633

Last seen 4.7 years ago

Germany

Besides removing mean, in some cases it might be useful to use z-scores of the gene profiles in the data matrix:

data.t<-t(data)
data.scaled<-t(scale(data.t))

If you use heatmap.2 you should input scaled data and use scale="none" as the source code shows that data are scaled after clustering for visualization.

ADD COMMENT • link 8.6 years ago deut13 ▴ 80

0

Entering edit mode

I have an argument for why not to use gene z-scores in a post below.

ADD REPLY • link 8.6 years ago Michael Love 41k

0

Entering edit mode

It is a good argument. I was thinking of finding genes with similar 'shape', so that in a case like

mat <- matrix(c(1,1,1,10,10,10,10,10,10,100,100,100,rep(1:2,6)),ncol=6,byrow=TRUE) one would separate genes 1-2 from 3-4.

ADD REPLY • link 8.6 years ago deut13 ▴ 80

0

Entering edit mode

ah I see. Yes if you are only concerned in comparing genes, then this could be useful.

ADD REPLY • link 8.6 years ago Michael Love 41k