I have a large matrix of over 100K observations to cluster using hierarchical clustering. Due to the large size, i do not have the computing power to calculate the distance matrix.
To overcome this problem I chose to aggregate my matrix to merge those observations which were identical to reduce my matrix to about 10K observations. I have the frequency for each of the rows in this aggregated matrix. I now need to incorporate this frequency as a weight in my hierarchical clustering.
I want to use hclust in the stats package. From the help information for hclust the arguments are as follows:
hclust(d, method = "complete", members = NULL)
The information for the members argument is:,
NULL or a vector with length size of
d. See the ‘Details’ section. When you look at the details section you get: If
members != NULL, then
d is taken to be a dissimilarity matrix between clusters instead of dissimilarities between singletons and
members gives the number of observations per cluster. This way the hierarchical cluster algorithm can be ‘started in the middle of the dendrogram’, e.g., in order to reconstruct the part of the tree above a cut (see examples). Dissimilarities between clusters can be efficiently computed (i.e., without
hclust itself) only for a limited number of distance/linkage combinations, the simplest one being squared Euclidean distance and centroid linkage. In this case the dissimilarities between the clusters are the squared Euclidean distances between cluster means.
From the above description, i am unsure if i can assign my frequency weights to the members arguments as it is not clear if this is the purpose of this argument. I would like to use it like this:
hclust(d, method = "complete", members = df$freq)
Where df$freq is the frequency of each row in the aggregated matrix. So if a row is duplicated 10 times this value would be 10.
If anyone can help me that would be great or can put me in contact with the developers of hclust as its not clear what this argument is for.