I have clustered a few points and have computed the mean of each cluster as a landmark point for that cluster. I have then computed a correlation matrix among all landmark points to see which are most similar. Now I'd like to *connect* each landmark points to its two most similar neighbors. Since these landmark points do not have X,Y coordinates on the clustering map, I am using the centroid points for each cluster as starting point to connect landmarks.
My `assignments` data.frame looks something like this:
> head(assignments) Transcripts Genes Timepoint Run Cluster V1 V2 Cell meanX meanY 8A_0_AATCTGCACCAA 143327 10542 Day 0 8A 6 113.8933 -2.1280855 8A_0_AATCTGCACCAA 124.3976 -8.682189 8A_0_CATGTCCTATCT 117322 10334 Day 0 8A 6 110.0499 -2.1553971 8A_0_CATGTCCTATCT 124.3976 -8.682189 8A_0_ATGCTCAATTGG 102764 9974 Day 0 8A 6 104.7227 -0.8397611 8A_0_ATGCTCAATTGG 124.3976 -8.682189 8A_0_CTACGGGAGAGT 92832 9651 Day 0 8A 6 101.3370 -5.0928108 8A_0_CTACGGGAGAGT 124.3976 -8.682189 8A_0_GTAGGGCGCGCT 90264 8807 Day 0 8A 6 113.3947 -18.9441484 8A_0_GTAGGGCGCGCT 124.3976 -8.682189 8A_0_ACGAGCTAACGG 83663 9148 Day 0 8A 7 114.6545 -31.6095622 8A_0_ACGAGCTAACGG 113.3952 -38.072025
.. and is used to generate the plot below:
ggplot(assignments, aes(V1, V2)) + geom_point(aes(colour=Cluster)) + geom_text(aes(meanX, meanY, label=Cluster), hjust=0.5, vjust=0.5, color='black', size=10)
Now given the following landmark correlation matrix (shown below), I'd like to connect each centroid point to it's nearest/most correlated two others.
> correlations 6 7 8 9 10 1 2 4 3 5 6 1.0000000 0.9659331 0.9493777 0.9242049 0.8824548 0.5397928 0.8931104 0.9017541 0.8970069 0.7533944 7 0.9659331 1.0000000 0.9135727 0.9470702 0.9150341 0.4877232 0.8631777 0.8704699 0.8943308 0.8030752 8 0.9493777 0.9135727 1.0000000 0.9579769 0.9186628 0.6490311 0.9633100 0.9537481 0.9353821 0.7405100 9 0.9242049 0.9470702 0.9579769 1.0000000 0.9566888 0.5907496 0.9251592 0.9224040 0.9222353 0.7897998 10 0.8824548 0.9150341 0.9186628 0.9566888 1.0000000 0.6575404 0.9384422 0.9444283 0.9417460 0.8206313 1 0.5397928 0.4877232 0.6490311 0.5907496 0.6575404 1.0000000 0.7263267 0.7691349 0.5890844 0.3941064 2 0.8931104 0.8631777 0.9633100 0.9251592 0.9384422 0.7263267 1.0000000 0.9693034 0.9484506 0.7347418 4 0.9017541 0.8704699 0.9537481 0.9224040 0.9444283 0.7691349 0.9693034 1.0000000 0.9168149 0.7184342 3 0.8970069 0.8943308 0.9353821 0.9222353 0.9417460 0.5890844 0.9484506 0.9168149 1.0000000 0.8427517 5 0.7533944 0.8030752 0.7405100 0.7897998 0.8206313 0.3941064 0.7347418 0.7184342 0.8427517 1.0000000
The resulting plot is anticipated to look similar to the plot above but with an over lay of a sort of network, where centroids are connect to 2 most similar neighbors/centroids. Any help would be greatly appreciated!
**EDIT**:
I should mention that the landmark cells which are used to produce the correlation matrix is simply an average of the underlying data for cells within the designated cluster:
# compute `landmark cell` for each cluster data = cbind(assignments, t(dge[,assignments$Cell])) cluster.gene.avg.list = list() for(n in unique(data$Cluster)) {temp.cluster = subset(data, Cluster==n)[,11:ncol(data)]; cluster.gene.avg.list[[n]] = rowMeans(t(temp.cluster))} landmark = do.call(cbind, cluster.gene.avg.list)
.. Where dge
are gene expression values and a matrix with dimensions of 16015 by 2449:
> head(dge[,1:5]) 8A_3_GACACGTAGGCC 8A_3_TTACAAATGTCA 8A_3_GCTCAAATCTTC 8A_7_CCGCCCCGACTT 8A_0_AATCTGCACCAA 0610005C13RIK 0.00000000 0.00000000 0.09081976 0.00000000 0.0000000 0610007P14RIK 0.34322315 0.39803339 0.72224870 0.80916196 0.3551089 0610009B22RIK 0.07548816 0.25172063 0.17625931 0.18493077 0.4317327 0610009L18RIK 0.00000000 0.17259527 0.09081976 0.00000000 0.0000000 0610009O20RIK 0.00000000 0.08887713 0.09081976 0.09542651 0.0000000 0610010B08RIK 0.56896378 0.91807267 0.83163550 0.86439381 0.763586