Question: benefits of pearson distance for hierarchical clustering
0
gravatar for Ahdee
11 months ago by
Ahdee40
United States
Ahdee40 wrote:

Dear all, I notice recently that  the heatmap3 package uses pearson distance instead of the default euclidean/manhattan.

as.dist(1 - cor(df, use = "pa"))

 

Is there a benefit of using correlation instead of euclidean when it comes to calculating distance? The reason why I ask is because the heatmap generated ( for gene expression matrix ) actually looks much better and I can actually see a nice pattern for expression between some clusters. I would like to study a few of the groups in the tree to see if there are any trends. However, I usually do this when the distance function was generated with euclidean and I'm not sure if I can do that with this method. Any suggestions?

 

thanks! 

heatmap heatmap.3 • 228 views
ADD COMMENTlink modified 11 months ago by James W. MacDonald51k • written 11 months ago by Ahdee40
Answer: benefits of pearson distance for hierarchical clustering
2
gravatar for James W. MacDonald
11 months ago by
United States
James W. MacDonald51k wrote:

You can make an argument that correlation (or 1-cor) is a better distance measure for a couple of reasons. First, the intensity and range gene expression measures are to a certain extent dependent on either the length (for RNA-Seq) or GC content (for microarrays), and if you use Euclidean distance you may be dominated by genes with large changes in measured expression, which may have more to do with technical aspects of the measurement rather than changes in the underlying expression levels.

Second, a distance of 1 means really different things, depending on the underlying values. If you have an expression value of 1 vs 2, that is way less meaningful than a difference of 1 between an expression value of say 8 vs 9. Remember, your data should be logged, so a difference of 1 vs 2 is 2 vs 4 in linear terms. But 8 vs 9 is the difference between 256 and 512 in linear terms, and as such is a more believable change. The correlation for low expressing genes will probably be really poor, but will get better (if there really is something consistent between samples) as the expression values get larger, so your correlation distance may be based on more believable differences between samples.

That said, clustering isn't an inferential method, and it's difficult to say if a given heatmap is better in some sense than another. Certainly one might look better, but I'm not sure that's a criterion you should really trust.

ADD COMMENTlink written 11 months ago by James W. MacDonald51k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 193 users visited in the last hour