Question: most efficient way to read adjacency matrix to make networkd3 graph
0
chrisclarkson10030 wrote:

I am inexperienced with iteration in R and am hoping to speed up a process as I am implementing some analysis in a website. I realise this may be a Q for Stackoverflow but I couldn't get an answer from them..

I found a very useful tutorial that allows me to iterate through an adjacency matrix (used to make a data frame as input for networkd3), pick out the criteria data above a certain threshold (>0.01) and then permeate three vectors: "source, target and corr" with these values to eventually make a nicely organised data frame.

source=c()
target=c()
corr<-c()
g2<-g1
for(gene in g1){
for(gen in g2){
if(m[gene,gen]>0.001){
source<-c(source,gene)
target<-c(target,gen)
corr<-c(corr,m[gene,gen])
}
}
}
network<-data.frame(source,target,corr)

However for big data frames this code is quite slow and I switched to vectorised format. And yet when I try it in vectorised format its output is different...

network2 = which(m > 0.001, arr.ind = TRUE)
source = network2[,1]
target  = network2[,2]
corr = m[m > 0.001]
network2<-cbind(network2, corr)
colnames(network2)<-c("source", "target", "corr")

My problem is that when I run these two pieces of code on the same vector

>network
source target        corr
1      n1     n1 1.000000000
2      n1     n3 0.001466542
3      n1     n7 0.001164927
4      n1     n9 0.004027818
5      n2     n2 1.000000000
6      n2     n9 0.002580978

The above is the expected output from the for loop code:

However for big data frames this code is quite slow and I switched to vectorised format. And yet when I try it in vectorised format its output is different...

network2 = which(m > 0.001)
source = network2[,1]
target  = network2[,2]
corr = m[m > 0.001]
network2<-cbind(network2, corr)
colnames(network2)<-c("source", "target", "corr")

The problem is that I get the following error:

matching_indices = which(m > 0.01)
source = matching_indices[,1]

Error in matching_indices[, 1] : incorrect number of dimensions

This is because I have not used the argument 'are.ind=TRUE' in the 'match' command. But specifying this gives undesired (as it takes only the self-interacting nodes into account) output... so I'm wondering what I should do about this?

The undesired output is as follows (notice that 'corr' is unanimously 1, only taking self replicating nodes into account):

> network2
source target corr
n1       1      1    1
n2       2      2    1
n3       3      3    1
n4       4      4    1
n5       5      5    1
n6       6      6    1

How do I get the output of the vectorised format code to be the same as the the for loop kind?

example data can be stored in the variable 'dat' with the following code:

library('dplyr')
library('igraph')
library('RColorBrewer')

set.seed(1)

# generate a couple clusters
nodes_per_cluster <- 30
n <- 10

nvals <- nodes_per_cluster * n

# cluster 1 (increasing)
cluster1 <- matrix(rep((1:n)/4, nodes_per_cluster) +
rnorm(nvals, sd=1),
nrow=nodes_per_cluster, byrow=TRUE)

# cluster 2 (decreasing)
cluster2 <- matrix(rep((n:1)/4, nodes_per_cluster) +
rnorm(nvals, sd=1),
nrow=nodes_per_cluster, byrow=TRUE)

# noise cluster
noise <- matrix(sample(1:2, nvals, replace=TRUE) +
rnorm(nvals, sd=1.5),
nrow=nodes_per_cluster, byrow=TRUE)

dat <- rbind(cluster1, cluster2, noise)
colnames(dat) <- paste0('n', 1:n)
rownames(dat) <- c(paste0('cluster1_', 1:nodes_per_cluster),
paste0('cluster2_', 1:nodes_per_cluster),
paste0('noise_',    1:nodes_per_cluster))
Answer: most efficient way to read adjacency matrix to make networkd3 graph
1
Martin Morgan ♦♦ 23k wrote:

Here's some data

set.seed(123)
m = cor(matrix(rnorm(100), 10, dimnames=list(LETTERS[1:10], LETTERS[1:10])))

Something like

idx = m > 0.001
df = data.frame(
source = row(m)[idx],
target = col(m)[idx],
corr = m[idx])

Code that uses a for loop, and 'copy-and-append' where a result grows inside the loop, should be viewed very suspiciously -- it makes n * (n-1) / 2 copies as the the result grows, so scales quadratically with the size of the problem.

The first few rows of the data frame are

source target       corr
1      1      1 1.00000000
2      2      1 0.57761512
3      4      1 0.67301956
4      6      1 0.09055842
5      8      1 0.65713875
6      1      2 0.57761512

A variant of this would use the row / column names rather than indexes

df = data.frame(
source = rownames(m)[ row(m)[idx] ],
target = colnames(m)[ col(m)[idx] ],
corr = m[idx])

This result is identical to the for loop, but with rows in a different order. This is shown by the following code, where I have run the for loop using adjacency = m

> df1 = df[order(df$source, df$target),]; rownames(df1) = NULL
> network1 = network[order(network$source, network$target),]; rownames(network1) = NULL
> identical(df1, network1)
 TRUE

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Martin Morgan ♦♦ 23k

Hi there thanks for the reply, while your code gets rid of the problem with needing to specify "arr.ind=TRUE" it produces the same output as the unwanted df given above- where it only seems to take self -interacting nodes into account.... do you reckon there is something wrong with the 'for loop' code that I give? I got it from an example tutorial and its output seems to indicate that it is attempting something different to that of the vectorised code... I am trying to take the edge values that are higher than 0.01 and make a network out of them....

I added the first few rows of output to my answer. The results include all nodes satisfying the criterion, not just self-interacting. The order of results is different from the for loop; is this what is causing you problems? I also added a variant where the row and column names are reported, rather than the indexes. Or put another way, what do you mean by 'self-interacting'?

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Martin Morgan ♦♦ 23k

Hi yes I have checked with my own code and unfortunately, while the output is improved as the self-interactive nodes are no longer a problem (i.e. the nodes have correlation values for nodes besides themselves), the output is till different to that of the 'for loop' code- self-interating means that the nodes interact  with themselves by being given a correlation value of 1 diagonally through the matrix you can get rid of this with 'diag(adj_matrix)<-0'

I updated my response to show that the results of the for loop and my approach (also, your approach using which(m > 0.001, arr.ind=TRUE)) are identical except the rows of the data frames are in different order.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Martin Morgan ♦♦ 23k

Thanks so much it seems to work like you said

Answer: most efficient way to read adjacency matrix to make networkd3 graph
0
phil.chapman140 wrote:

It's quite difficult to follow your code without being able to run it.  As I understand it you're starting from a matrix, so perhaps you could make a very small, toy object that you could include in your code to get us started?  The following approach may be useful as it generates an R command at the end which can be copied and pasted into a script and run to construct the (in this case) toy object:

> data(mtcars)
> m <- as.matrix(mtcars)
> toy <- m[1:5,1:5]
> toy
mpg cyl disp  hp drat
Mazda RX4         21.0   6  160 110 3.90
Mazda RX4 Wag     21.0   6  160 110 3.90
Datsun 710        22.8   4  108  93 3.85
Hornet 4 Drive    21.4   6  258 110 3.08
Hornet Sportabout 18.7   8  360 175 3.15
> dput(toy)
structure(c(21, 21, 22.8, 21.4, 18.7, 6, 6, 4, 6, 8, 160, 160,
108, 258, 360, 110, 110, 93, 110, 175, 3.9, 3.9, 3.85, 3.08,
3.15), .Dim = c(5L, 5L), .Dimnames = list(c("Mazda RX4", "Mazda RX4 Wag",
"Datsun 710", "Hornet 4 Drive", "Hornet Sportabout"), c("mpg",
"cyl", "disp", "hp", "drat")))

hi there, thanks for the reply does this adj matrix work for you?

6450255     2570615      6370619       2600039     2650615      5340672       2000519     3870044
6450255  1.0000000  0.27049207 -0.189722700  0.4304964296  0.26144185  0.203182365  0.5261722409  0.28750105
2570615  0.2704921  1.00000000  0.042253692  0.3269207024  0.26079551  0.172129526 -0.1091231736  0.13630307
6370619 -0.1897227  0.04225369  1.000000000 -0.3046903972  0.15350119 -0.002081143  0.0154829510 -0.28944753
2600039  0.4304964  0.32692070 -0.304690397  1.0000000000 -0.05116379 -0.265505477 -0.0003627644  0.18117889
2650615  0.2614419  0.26079551  0.153501188 -0.0511637926  1.00000000  0.070419824  0.1920755290  0.39206708
5340672  0.2031824  0.17212953 -0.002081143 -0.2655054768  0.07041982  1.000000000  0.0118562293  0.23229705
2000519  0.5261722 -0.10912317  0.015482951 -0.0003627644  0.19207553  0.011856229  1.0000000000  0.01503415
3870044  0.2875011  0.13630307 -0.289447531  0.1811788864  0.39206708  0.232297053  0.0150341497  1.00000000
7050209  0.1073443 -0.11619209 -0.043299033  0.0086922869  0.13417166 -0.002184810  0.5265883416  0.11916228
1580181 -0.4386352 -0.20304406  0.441508770 -0.3195269908 -0.17324532 -0.086536659 -0.0233773362 -0.34569040
7050209     1580181
6450255  0.107344307 -0.43863522
2570615 -0.116192095 -0.20304406
6370619 -0.043299033  0.44150877
2600039  0.008692287 -0.31952699
2650615  0.134171658 -0.17324532
5340672 -0.002184810 -0.08653666
2000519  0.526588342 -0.02337734
3870044  0.119162282 -0.34569040
7050209  1.000000000  0.05452883
1580181  0.054528831  1.00000000

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by chrisclarkson10030