Question

Questions regarding hclust parameters and output

0

Entering edit mode

Dana Sevak ▴ 10

@dana-sevak-3697

Last seen 10.3 years ago

Dear BioC list members, I have a few questions regarding the parameters and the output of the hclust function. I have a set of five profiles comprised of measurements of five variables (V1 to V5) in 4 different conditions (c1 to c4). You could think about these as log fold-changes for five related genes across four conditions where the values where multiplied by 100 and rounded for simplicity. The tasks is to determine how correlated these profiles are and based on that, to partition them in two groups using hierarchical clustering. For this I used the hclust and cutree functions as follows: df = data.frame(rbind(c(-11, 8, -19, 5, 0), c(-2, 22, 14, 21, 7), c(7, -16, -43, 20, -3), c(-8, -16, -29, 10, -9))) colnames(df) = c("V1", "V2", "V3", "V4", "V5") rownames(df) = c("c1", "c2", "c3", "c4") par(mfrow=c(2,1)) plot(df[,1], type="l", ylim=range(df)) points(df[1,1], type="p", pch=49) for (i in 2:5) { points(df[,i], type="l", col=colors()[15*i]) points(df[1,i], type="p", pch=48+i) } cor.df = cor(df, method="pearson") dist.df = as.dist(1-cor.df) hc.df = hclust(dist.df, method="average") hc.df.cl = cutree(hc.df, k=2) plot(hc.df) hc.df$order par(mfrow=c(1,1)) My questions are: 1. Do I need to center the values of df (again these are logFC values) and if so with what function (scale(df) ?) before computing the correlation coefficient matrix, or do I need to use the uncentered correlation coefficient from cor.dist from the bioDist package? 2. Given the fact that I want to cluster by correlation so that the variables with most similar profiles are placed together, is method="average" the best choice or shall I use "centroid" or "single", or another instead? 3. Importantly, I want the value returned by hc.df$order to reflect the closeness of the values in terms of correlation. That is the right- most variable in the left (tightest) cluster should be most correlated to the left-most variable of the right cluster. However, hc.df$order seems to indicate that V1 (first in the right- most cluster) is more correlated to V3 (listed last in the left-most cluster) >hc.df$order [1] 5 2 3 1 4 > hc.df.cl V1 V2 V3 V4 V5 1 2 2 1 2 However from the plot of profiles it seems that the profile of V1 more correlated to V5. By convention, hclust lists the singlet at the left-most so V5 will appear before V2 and V3 in the hc.df$order. However, if this keeping the order of correlation is important for me, am I justified to modify this listing and place the singlet on the right of the first cluster (that is to output the order V2 V3 V5 V1 V4)? Does the height of the node in plot(hc.df) support the fact that V1 is more correlated to V5 than V3? Thank you very much in advance for your kind help. Best regards, Dana Sevak PS. I had problems with my email account and I apologize if you received a duplicate versions of this message (although judging from the archive I don't think it went through). [[alternative HTML version deleted]]

Clustering bioDist Clustering bioDist • 1.5k views

ADD COMMENT • link 15.3 years ago Dana Sevak ▴ 10