Question

scaling data for clustering heat map

1

Entering edit mode

Beginner ▴ 60

@beginner-15939

Last seen 22 months ago

Switzerland

I have RNA-seq and Microarray data.

For RNA-seq data I went through edgeR tutorial and using it for Differential analysis. In this there is a step for scaling the each gene with logCPM <- t(scale(t(logCPM))) Is this Z-scale data? Is there any difference between Z-scale and Z-score?

I'm completely new to microarray gene expression data analysis. I have a matrix with rows as genes and columns as samples with gene intensities. I would like to know which package to use for normalisation and how to proceed further to make a clustering heat map with that?

Thankyou

edger heatmap clustering differential gene expression rnaseq • 6.6k views

ADD COMMENT • link updated 6.7 years ago by Gordon Smyth 52k • written 6.7 years ago by Beginner ▴ 60

score 3 · Answer 1 · 2018-05-31

I have never heard of Z-scale data, so can't speak to that. But a z-score is computed by centering and scaling the data by the standard deviation, which is what the scale function does, so long as you use the defaults.

As for generating a heatmap, the ComplexHeatmap package is a reasonable choice. How you process your data prior to generating a heatmap is dependent on how you want the resulting colors to be interpreted. To generate the colors for the heatmap, by definition the median of the values is set to be the 'middle' of the data, and the colors represent how much the data diverges from the median value.

If you use logCPM, then the median value will be somewhere around say 6-7, and the colors will represent how much a particular gene, in a particular sample diverges from that value. So the apparently down-regulated genes will be down-regulated as compared to the median of all the genes (and the same for the up-regulated genes). This may or may not be useful. But the differences can be interpreted as log fold changes, so that part is OK.

If you use z-scores, then the colors represent the divergence of a particular gene in a particular sample, as compared to the mean value for that gene over all samples. In units of standard deviations. So you can readily see which samples are going up or down for each gene, but it's harder to interpret the amount they are changing because it's dependent on how large the standard deviation for that gene is.

score 1 · Answer 2 · 2018-05-31

The term "z scale" is seldom used in statistics but, when it is, it means the same as "z score".

In the edgeR tutorial, you can also use

coolmap(logCPM)

which does most of the work for you. It will automatically standardize the expression values, so you don't need to do that yourself before hand (like in the edgeR tutorial).

It is the same with microarrays. You can use

coolmap(x)

where x is the data object of normalized log-expression values.

The ComplexHeatmap package is a good choice if you are an expert, but choosing clustering metrics for microarray and RNA-seq data is deceptively tricky. People often do it incorrectly. We created the coolmap function so that the default settings are appropriate for expression data.