scaling data for clustering heat map
2
1
Entering edit mode
Beginner ▴ 60
@beginner-15939
Last seen 21 months ago
Switzerland

I have RNA-seq and Microarray data. 

For RNA-seq data I went through edgeR tutorial and using it for Differential analysis. In this there is a step for scaling the each gene with logCPM <- t(scale(t(logCPM)))  Is this Z-scale data? Is there any difference between Z-scale and Z-score? 

I'm completely new to microarray gene expression data analysis. I have a matrix with rows as genes and columns as samples with gene intensities. I would like to know which package to use for normalisation and how to proceed further to make a clustering heat map with that?

Thankyou

edger heatmap clustering differential gene expression rnaseq • 6.5k views
ADD COMMENT
3
Entering edit mode
@james-w-macdonald-5106
Last seen 7 hours ago
United States

I have never heard of Z-scale data, so can't speak to that. But a z-score is computed by centering and scaling the data by the standard deviation, which is what the scale function does, so long as you use the defaults.

As for generating a heatmap, the ComplexHeatmap package is a reasonable choice. How you process your data prior to generating a heatmap is dependent on how you want the resulting colors to be interpreted. To generate the colors for the heatmap, by definition the median of the values is set to be the 'middle' of the data, and the colors represent how much the data diverges from the median value.

If you use logCPM, then the median value will be somewhere around say 6-7, and the colors will represent how much a particular gene, in a particular sample diverges from that value. So the apparently down-regulated genes will be down-regulated as compared to the median of all the genes (and the same for the up-regulated genes). This may or may not be useful. But the differences can be interpreted as log fold changes, so that part is OK.

If you use z-scores, then the colors represent the divergence of a particular gene in a particular sample, as compared to the mean value for that gene over all samples. In units of standard deviations. So you can readily see which samples are going up or down for each gene, but it's harder to interpret the amount they are changing because it's dependent on how large the standard deviation for that gene is.

 

ADD COMMENT
0
Entering edit mode

Thanks for the information James. Some where I saw people talking z-scale. Anyways with the function I mentioned it is Z-score. But could you please tell me how I can proceed with microarray data? which package needs to be used for normalisation and calculation of z-score?

ADD REPLY
1
Entering edit mode

It depends on the data. You just say you have a matrix. Certainly limma can handle that, but how you normalize is dependent on the type of data, and whether or not it has already been normalized.

ADD REPLY
1
Entering edit mode

Thank you for the information

ADD REPLY
1
Entering edit mode
@gordon-smyth
Last seen 16 hours ago
WEHI, Melbourne, Australia

The term "z scale" is seldom used in statistics but, when it is, it means the same as "z score".

In the edgeR tutorial, you can also use

coolmap(logCPM)

which does most of the work for you. It will automatically standardize the expression values, so you don't need to do that yourself before hand (like in the edgeR tutorial).

It is the same with microarrays. You can use

coolmap(x)

where x is the data object of normalized log-expression values.

The ComplexHeatmap package is a good choice if you are an expert, but choosing clustering metrics for microarray and RNA-seq data is deceptively tricky. People often do it incorrectly. We created the coolmap function so that the default settings are appropriate for expression data.

ADD COMMENT
0
Entering edit mode

This is very useful information. Thanks a lot Gordon.

ADD REPLY

Login before adding your answer.

Traffic: 547 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6