Question: scaling data for clustering heat map
1
17 months ago by
Beginner50
Beginner50 wrote:

I have RNA-seq and Microarray data.

For RNA-seq data I went through edgeR tutorial and using it for Differential analysis. In this there is a step for scaling the each gene with logCPM <- t(scale(t(logCPM)))  Is this Z-scale data? Is there any difference between Z-scale and Z-score?

I'm completely new to microarray gene expression data analysis. I have a matrix with rows as genes and columns as samples with gene intensities. I would like to know which package to use for normalisation and how to proceed further to make a clustering heat map with that?

Thankyou

modified 17 months ago by Gordon Smyth39k • written 17 months ago by Beginner50
Answer: scaling data for clustering heat map
2
17 months ago by
United States
James W. MacDonald51k wrote:

I have never heard of Z-scale data, so can't speak to that. But a z-score is computed by centering and scaling the data by the standard deviation, which is what the scale function does, so long as you use the defaults.

As for generating a heatmap, the ComplexHeatmap package is a reasonable choice. How you process your data prior to generating a heatmap is dependent on how you want the resulting colors to be interpreted. To generate the colors for the heatmap, by definition the median of the values is set to be the 'middle' of the data, and the colors represent how much the data diverges from the median value.

If you use logCPM, then the median value will be somewhere around say 6-7, and the colors will represent how much a particular gene, in a particular sample diverges from that value. So the apparently down-regulated genes will be down-regulated as compared to the median of all the genes (and the same for the up-regulated genes). This may or may not be useful. But the differences can be interpreted as log fold changes, so that part is OK.

If you use z-scores, then the colors represent the divergence of a particular gene in a particular sample, as compared to the mean value for that gene over all samples. In units of standard deviations. So you can readily see which samples are going up or down for each gene, but it's harder to interpret the amount they are changing because it's dependent on how large the standard deviation for that gene is.

Thanks for the information James. Some where I saw people talking z-scale. Anyways with the function I mentioned it is Z-score. But could you please tell me how I can proceed with microarray data? which package needs to be used for normalisation and calculation of z-score?

1

It depends on the data. You just say you have a matrix. Certainly limma can handle that, but how you normalize is dependent on the type of data, and whether or not it has already been normalized.

1

Thank you for the information

Answer: scaling data for clustering heat map
1
17 months ago by
Gordon Smyth39k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth39k wrote:

The term "z scale" is seldom used in statistics but, when it is, it means the same as "z score".

In the edgeR tutorial, you can also use

coolmap(logCPM)

which does most of the work for you. It will automatically standardize the expression values, so you don't need to do that yourself before hand (like in the edgeR tutorial).

It is the same with microarrays. You can use

coolmap(x)

where x is the data object of normalized log-expression values.

The ComplexHeatmap package is a good choice if you are an expert, but choosing clustering metrics for microarray and RNA-seq data is deceptively tricky. People often do it incorrectly. We created the coolmap function so that the default settings are appropriate for expression data.