Search
Question: most efficient way to read adjacency matrix to make networkd3 graph
0
2.3 years ago by
chrisclarkson10030 wrote:

I am inexperienced with iteration in R and am hoping to speed up a process as I am implementing some analysis in a website. I realise this may be a Q for Stackoverflow but I couldn't get an answer from them..

I found a very useful tutorial that allows me to iterate through an adjacency matrix (used to make a data frame as input for networkd3), pick out the criteria data above a certain threshold (>0.01) and then permeate three vectors: "source, target and corr" with these values to eventually make a nicely organised data frame.

adjacency<-adjacency(dat)
source=c()
target=c()
corr<-c()
g2<-g1
for(gene in g1){
for(gen in g2){
if(m[gene,gen]>0.001){
source<-c(source,gene)
target<-c(target,gen)
corr<-c(corr,m[gene,gen])
}
}
}
network<-data.frame(source,target,corr)

However for big data frames this code is quite slow and I switched to vectorised format. And yet when I try it in vectorised format its output is different...

network2 = which(m > 0.001, arr.ind = TRUE)
source = network2[,1]
target  = network2[,2]
corr = m[m > 0.001]
network2<-cbind(network2, corr)
colnames(network2)<-c("source", "target", "corr")

My problem is that when I run these two pieces of code on the same vector

>network
source target        corr
1      n1     n1 1.000000000
2      n1     n3 0.001466542
3      n1     n7 0.001164927
4      n1     n9 0.004027818
5      n2     n2 1.000000000
6      n2     n9 0.002580978

The above is the expected output from the for loop code:

However for big data frames this code is quite slow and I switched to vectorised format. And yet when I try it in vectorised format its output is different...

network2 = which(m > 0.001)
source = network2[,1]
target  = network2[,2]
corr = m[m > 0.001]
network2<-cbind(network2, corr)
colnames(network2)<-c("source", "target", "corr")

The problem is that I get the following error:

matching_indices = which(m > 0.01)
source = matching_indices[,1]

Error in matching_indices[, 1] : incorrect number of dimensions

This is because I have not used the argument 'are.ind=TRUE' in the 'match' command. But specifying this gives undesired (as it takes only the self-interacting nodes into account) output... so I'm wondering what I should do about this?

The undesired output is as follows (notice that 'corr' is unanimously 1, only taking self replicating nodes into account):

> network2
source target corr
n1       1      1    1
n2       2      2    1
n3       3      3    1
n4       4      4    1
n5       5      5    1
n6       6      6    1

How do I get the output of the vectorised format code to be the same as the the for loop kind?

example data can be stored in the variable 'dat' with the following code:

library('dplyr')
library('igraph')
library('RColorBrewer')

set.seed(1)

# generate a couple clusters
nodes_per_cluster <- 30
n <- 10

nvals <- nodes_per_cluster * n

# cluster 1 (increasing)
cluster1 <- matrix(rep((1:n)/4, nodes_per_cluster) +
rnorm(nvals, sd=1),
nrow=nodes_per_cluster, byrow=TRUE)

# cluster 2 (decreasing)
cluster2 <- matrix(rep((n:1)/4, nodes_per_cluster) +
rnorm(nvals, sd=1),
nrow=nodes_per_cluster, byrow=TRUE)

# noise cluster
noise <- matrix(sample(1:2, nvals, replace=TRUE) +
rnorm(nvals, sd=1.5),
nrow=nodes_per_cluster, byrow=TRUE)

dat <- rbind(cluster1, cluster2, noise)
colnames(dat) <- paste0('n', 1:n)
rownames(dat) <- c(paste0('cluster1_', 1:nodes_per_cluster),
paste0('cluster2_', 1:nodes_per_cluster),
paste0('noise_',    1:nodes_per_cluster))
modified 2.3 years ago by phil.chapman140 • written 2.3 years ago by chrisclarkson10030
1
2.3 years ago by
Martin Morgan ♦♦ 22k
United States
Martin Morgan ♦♦ 22k wrote:

Here's some data

set.seed(123)
m = cor(matrix(rnorm(100), 10, dimnames=list(LETTERS[1:10], LETTERS[1:10])))

Something like

idx = m > 0.001
df = data.frame(
source = row(m)[idx],
target = col(m)[idx],
corr = m[idx])

Code that uses a for loop, and 'copy-and-append' where a result grows inside the loop, should be viewed very suspiciously -- it makes n * (n-1) / 2 copies as the the result grows, so scales quadratically with the size of the problem.

The first few rows of the data frame are

> head(df)
source target       corr
1      1      1 1.00000000
2      2      1 0.57761512
3      4      1 0.67301956
4      6      1 0.09055842
5      8      1 0.65713875
6      1      2 0.57761512

A variant of this would use the row / column names rather than indexes

df = data.frame(
source = rownames(m)[ row(m)[idx] ],
target = colnames(m)[ col(m)[idx] ],
corr = m[idx])

This result is identical to the for loop, but with rows in a different order. This is shown by the following code, where I have run the for loop using adjacency = m

> df1 = df[order(df$source, df$target),]; rownames(df1) = NULL
> network1 = network[order(network$source, network$target),]; rownames(network1) = NULL
> identical(df1, network1)
[1] TRUE

Hi there thanks for the reply, while your code gets rid of the problem with needing to specify "arr.ind=TRUE" it produces the same output as the unwanted df given above- where it only seems to take self -interacting nodes into account.... do you reckon there is something wrong with the 'for loop' code that I give? I got it from an example tutorial and its output seems to indicate that it is attempting something different to that of the vectorised code... I am trying to take the edge values that are higher than 0.01 and make a network out of them....

I added the first few rows of output to my answer. The results include all nodes satisfying the criterion, not just self-interacting. The order of results is different from the for loop; is this what is causing you problems? I also added a variant where the row and column names are reported, rather than the indexes. Or put another way, what do you mean by 'self-interacting'?

Hi yes I have checked with my own code and unfortunately, while the output is improved as the self-interactive nodes are no longer a problem (i.e. the nodes have correlation values for nodes besides themselves), the output is till different to that of the 'for loop' code- self-interating means that the nodes interact  with themselves by being given a correlation value of 1 diagonally through the matrix you can get rid of this with 'diag(adj_matrix)<-0'

I updated my response to show that the results of the for loop and my approach (also, your approach using which(m > 0.001, arr.ind=TRUE)) are identical except the rows of the data frames are in different order.

Thanks so much it seems to work like you said

0
2.3 years ago by
phil.chapman140
United Kingdom
phil.chapman140 wrote:

It's quite difficult to follow your code without being able to run it.  As I understand it you're starting from a matrix, so perhaps you could make a very small, toy object that you could include in your code to get us started?  The following approach may be useful as it generates an R command at the end which can be copied and pasted into a script and run to construct the (in this case) toy object:

> data(mtcars)
> m <- as.matrix(mtcars)
> toy <- m[1:5,1:5]
> toy
mpg cyl disp  hp drat
Mazda RX4         21.0   6  160 110 3.90
Mazda RX4 Wag     21.0   6  160 110 3.90
Datsun 710        22.8   4  108  93 3.85
Hornet 4 Drive    21.4   6  258 110 3.08
Hornet Sportabout 18.7   8  360 175 3.15
> dput(toy)
structure(c(21, 21, 22.8, 21.4, 18.7, 6, 6, 4, 6, 8, 160, 160,
108, 258, 360, 110, 110, 93, 110, 175, 3.9, 3.9, 3.85, 3.08,
3.15), .Dim = c(5L, 5L), .Dimnames = list(c("Mazda RX4", "Mazda RX4 Wag",
"Datsun 710", "Hornet 4 Drive", "Hornet Sportabout"), c("mpg",
"cyl", "disp", "hp", "drat")))

hi there, thanks for the reply does this adj matrix work for you?

           6450255     2570615      6370619       2600039     2650615      5340672       2000519     3870044 6450255  1.0000000  0.27049207 -0.189722700  0.4304964296  0.26144185  0.203182365  0.5261722409  0.28750105 2570615  0.2704921  1.00000000  0.042253692  0.3269207024  0.26079551  0.172129526 -0.1091231736  0.13630307 6370619 -0.1897227  0.04225369  1.000000000 -0.3046903972  0.15350119 -0.002081143  0.0154829510 -0.28944753 2600039  0.4304964  0.32692070 -0.304690397  1.0000000000 -0.05116379 -0.265505477 -0.0003627644  0.18117889 2650615  0.2614419  0.26079551  0.153501188 -0.0511637926  1.00000000  0.070419824  0.1920755290  0.39206708 5340672  0.2031824  0.17212953 -0.002081143 -0.2655054768  0.07041982  1.000000000  0.0118562293  0.23229705 2000519  0.5261722 -0.10912317  0.015482951 -0.0003627644  0.19207553  0.011856229  1.0000000000  0.01503415 3870044  0.2875011  0.13630307 -0.289447531  0.1811788864  0.39206708  0.232297053  0.0150341497  1.00000000 7050209  0.1073443 -0.11619209 -0.043299033  0.0086922869  0.13417166 -0.002184810  0.5265883416  0.11916228 1580181 -0.4386352 -0.20304406  0.441508770 -0.3195269908 -0.17324532 -0.086536659 -0.0233773362 -0.34569040              7050209     1580181 6450255  0.107344307 -0.43863522 2570615 -0.116192095 -0.20304406 6370619 -0.043299033  0.44150877 2600039  0.008692287 -0.31952699 2650615  0.134171658 -0.17324532 5340672 -0.002184810 -0.08653666 2000519  0.526588342 -0.02337734 3870044  0.119162282 -0.34569040 7050209  1.000000000  0.05452883 1580181  0.054528831  1.00000000