Entering edit mode
Dana Sevak
▴
10
@dana-sevak-3697
Last seen 10.3 years ago
Dear BioC list members,
I have a few questions regarding the parameters and the output of the
hclust
function.
I have a set of five profiles comprised of measurements of five
variables
(V1 to V5) in 4 different conditions (c1 to c4). You could think
about
these as log fold-changes for five related genes across four
conditions
where the values where multiplied by 100 and rounded for simplicity.
The tasks is to determine how correlated these profiles are and based
on
that, to partition them in two groups using hierarchical clustering.
For
this I used the hclust and cutree functions as follows:
df = data.frame(rbind(c(-11, 8, -19, 5, 0), c(-2, 22, 14, 21, 7), c(7,
-16,
-43, 20, -3), c(-8, -16, -29, 10, -9)))
colnames(df) = c("V1", "V2", "V3", "V4", "V5")
rownames(df) = c("c1", "c2", "c3", "c4")
par(mfrow=c(2,1))
plot(df[,1], type="l", ylim=range(df))
points(df[1,1], type="p", pch=49)
for (i in 2:5) {
points(df[,i], type="l", col=colors()[15*i])
points(df[1,i], type="p", pch=48+i)
}
cor.df = cor(df, method="pearson")
dist.df = as.dist(1-cor.df)
hc.df = hclust(dist.df, method="average")
hc.df.cl = cutree(hc.df, k=2)
plot(hc.df)
hc.df$order
par(mfrow=c(1,1))
My questions are:
1. Do I need to center the values of df (again these are logFC values)
and
if so with what function (scale(df) ?) before computing the
correlation
coefficient matrix, or do I need to use the uncentered correlation
coefficient from cor.dist from the bioDist package?
2. Given the fact that I want to cluster by correlation so that the
variables with most similar profiles are placed together, is
method="average" the best choice or shall I use "centroid" or
"single", or
another instead?
3. Importantly, I want the value returned by hc.df$order to reflect
the
closeness of the values in terms of correlation. That is the right-
most
variable in the left (tightest) cluster should be most correlated to
the
left-most variable of the right cluster.
However, hc.df$order seems to indicate that V1 (first in the right-
most
cluster) is more correlated to V3 (listed last in the left-most
cluster)
>hc.df$order
[1] 5 2 3 1 4
> hc.df.cl
V1 V2 V3 V4 V5
1 2 2 1 2
However from the plot of profiles it seems that the profile of V1 more
correlated to V5. By convention, hclust lists the singlet at the
left-most
so V5 will appear before V2 and V3 in the hc.df$order. However, if
this
keeping the order of correlation is important for me, am I justified
to
modify this listing and place the singlet on the right of the first
cluster
(that is to output the order V2 V3 V5 V1 V4)? Does the height of the
node
in plot(hc.df) support the fact that V1 is more correlated to V5 than
V3?
Thank you very much in advance for your kind help.
Best regards,
Dana Sevak
PS. I had problems with my email account and I apologize if you
received a
duplicate versions of this message (although judging from the archive
I
don't think it went through).
[[alternative HTML version deleted]]