Question

Building a pseudo-network based on correlations

0

Entering edit mode

ssabri ▴ 20

@ssabri-9464

Last seen 4.0 years ago

I have clustered a few points and have computed the mean of each cluster as a landmark point for that cluster. I have then computed a correlation matrix among all landmark points to see which are most similar. Now I'd like to *connect* each landmark points to its two most similar neighbors. Since these landmark points do not have X,Y coordinates on the clustering map, I am using the centroid points for each cluster as starting point to connect landmarks.

My `assignments` data.frame looks something like this:

    > head(assignments)
                      Transcripts Genes Timepoint Run Cluster       V1          V2              Cell    meanX      meanY
    8A_0_AATCTGCACCAA      143327 10542     Day 0  8A       6 113.8933  -2.1280855 8A_0_AATCTGCACCAA 124.3976  -8.682189
    8A_0_CATGTCCTATCT      117322 10334     Day 0  8A       6 110.0499  -2.1553971 8A_0_CATGTCCTATCT 124.3976  -8.682189
    8A_0_ATGCTCAATTGG      102764  9974     Day 0  8A       6 104.7227  -0.8397611 8A_0_ATGCTCAATTGG 124.3976  -8.682189
    8A_0_CTACGGGAGAGT       92832  9651     Day 0  8A       6 101.3370  -5.0928108 8A_0_CTACGGGAGAGT 124.3976  -8.682189
    8A_0_GTAGGGCGCGCT       90264  8807     Day 0  8A       6 113.3947 -18.9441484 8A_0_GTAGGGCGCGCT 124.3976  -8.682189
    8A_0_ACGAGCTAACGG       83663  9148     Day 0  8A       7 114.6545 -31.6095622 8A_0_ACGAGCTAACGG 113.3952 -38.072025

.. and is used to generate the plot below:

   ggplot(assignments, aes(V1, V2)) + geom_point(aes(colour=Cluster)) + geom_text(aes(meanX, meanY, label=Cluster), hjust=0.5, vjust=0.5, color='black', size=10)

Now given the following landmark correlation matrix (shown below), I'd like to connect each centroid point to it's nearest/most correlated two others.

    > correlations
               6         7         8         9        10         1         2         4         3         5
    6  1.0000000 0.9659331 0.9493777 0.9242049 0.8824548 0.5397928 0.8931104 0.9017541 0.8970069 0.7533944
    7  0.9659331 1.0000000 0.9135727 0.9470702 0.9150341 0.4877232 0.8631777 0.8704699 0.8943308 0.8030752
    8  0.9493777 0.9135727 1.0000000 0.9579769 0.9186628 0.6490311 0.9633100 0.9537481 0.9353821 0.7405100
    9  0.9242049 0.9470702 0.9579769 1.0000000 0.9566888 0.5907496 0.9251592 0.9224040 0.9222353 0.7897998
    10 0.8824548 0.9150341 0.9186628 0.9566888 1.0000000 0.6575404 0.9384422 0.9444283 0.9417460 0.8206313
    1  0.5397928 0.4877232 0.6490311 0.5907496 0.6575404 1.0000000 0.7263267 0.7691349 0.5890844 0.3941064
    2  0.8931104 0.8631777 0.9633100 0.9251592 0.9384422 0.7263267 1.0000000 0.9693034 0.9484506 0.7347418
    4  0.9017541 0.8704699 0.9537481 0.9224040 0.9444283 0.7691349 0.9693034 1.0000000 0.9168149 0.7184342
    3  0.8970069 0.8943308 0.9353821 0.9222353 0.9417460 0.5890844 0.9484506 0.9168149 1.0000000 0.8427517
    5  0.7533944 0.8030752 0.7405100 0.7897998 0.8206313 0.3941064 0.7347418 0.7184342 0.8427517 1.0000000

The resulting plot is anticipated to look similar to the plot above but with an over lay of a sort of network, where centroids are connect to 2 most similar neighbors/centroids. Any help would be greatly appreciated!

**EDIT**:

I should mention that the landmark cells which are used to produce the correlation matrix is simply an average of the underlying data for cells within the designated cluster:

    # compute `landmark cell` for each cluster
    data = cbind(assignments, t(dge[,assignments$Cell]))
    cluster.gene.avg.list = list()
    for(n in unique(data$Cluster)) {temp.cluster = subset(data, Cluster==n)[,11:ncol(data)]; cluster.gene.avg.list[[n]] = rowMeans(t(temp.cluster))}
    landmark = do.call(cbind, cluster.gene.avg.list)

.. Where dge are gene expression values and a matrix with dimensions of 16015 by 2449:

    > head(dge[,1:5])
                  8A_3_GACACGTAGGCC 8A_3_TTACAAATGTCA 8A_3_GCTCAAATCTTC 8A_7_CCGCCCCGACTT 8A_0_AATCTGCACCAA
    0610005C13RIK        0.00000000        0.00000000        0.09081976        0.00000000         0.0000000
    0610007P14RIK        0.34322315        0.39803339        0.72224870        0.80916196         0.3551089
    0610009B22RIK        0.07548816        0.25172063        0.17625931        0.18493077         0.4317327
    0610009L18RIK        0.00000000        0.17259527        0.09081976        0.00000000         0.0000000
    0610009O20RIK        0.00000000        0.08887713        0.09081976        0.09542651         0.0000000
    0610010B08RIK        0.56896378        0.91807267        0.83163550        0.86439381         0.763586

R ggplot2 • 949 views

ADD COMMENT • link updated 7.5 years ago by Matthias Z. ▴ 20 • written 7.5 years ago by ssabri ▴ 20

score 0 · Answer 1 · 2016-10-27

To me it is not clear what your specific question is, so I presume, it is how to display data, which you have already generated and not how to calculate which are the two most similar edges to draw connections to.

The suitable ggplot2 geoms to plot connections are geom_segment() or geom_curve() depending whether you would like to display straight lines or curves.

In general I however would recommend to invest some more time for learning and use one of the packages, that were specifically written for networks - in particular ggnet2 or geomnet. HiveR also might be worth a look - hive plots to me seem to be exactly the type of plot you are looking for to nicely represent your data.

Best

Matthias