Question

benefits of pearson distance for hierarchical clustering

0

Entering edit mode

Ahdee ▴ 50

@ahdee-8938

Last seen 18 months ago

United States

Dear all, I notice recently that the heatmap3 package uses pearson distance instead of the default euclidean/manhattan.

as.dist(1 - cor(df, use = "pa"))

Is there a benefit of using correlation instead of euclidean when it comes to calculating distance? The reason why I ask is because the heatmap generated ( for gene expression matrix ) actually looks much better and I can actually see a nice pattern for expression between some clusters. I would like to study a few of the groups in the tree to see if there are any trends. However, I usually do this when the distance function was generated with euclidean and I'm not sure if I can do that with this method. Any suggestions?

thanks!

heatmap.3 heatmap • 2.4k views

ADD COMMENT • link updated 5.4 years ago by James W. MacDonald 65k • written 5.4 years ago by Ahdee ▴ 50

score 4 · Accepted Answer · 2018-12-10

You can make an argument that correlation (or 1-cor) is a better distance measure for a couple of reasons. First, the intensity and range gene expression measures are to a certain extent dependent on either the length (for RNA-Seq) or GC content (for microarrays), and if you use Euclidean distance you may be dominated by genes with large changes in measured expression, which may have more to do with technical aspects of the measurement rather than changes in the underlying expression levels.

Second, a distance of 1 means really different things, depending on the underlying values. If you have an expression value of 1 vs 2, that is way less meaningful than a difference of 1 between an expression value of say 8 vs 9. Remember, your data should be logged, so a difference of 1 vs 2 is 2 vs 4 in linear terms. But 8 vs 9 is the difference between 256 and 512 in linear terms, and as such is a more believable change. The correlation for low expressing genes will probably be really poor, but will get better (if there really is something consistent between samples) as the expression values get larger, so your correlation distance may be based on more believable differences between samples.

That said, clustering isn't an inferential method, and it's difficult to say if a given heatmap is better in some sense than another. Certainly one might look better, but I'm not sure that's a criterion you should really trust.