Question

slow insertions in to graphNEL object (24 hours for 16k nodes)

0

Entering edit mode

Paul Shannon ★ 1.1k

@paul-shannon-578

Last seen 9.6 years ago

It took nearly 24 hours (!) to create a 16k node graph using two different techniques: g = fromGXL (file ('someFile.gxl')) and g = new ('graphNEL', edgemode='undirected') edgeDataDefaults (g, attr='edgeType') = 'edge' edgeDataDefaults (g, attr='source') = 'unknown' ... for (r in 1:max) { ... g = addNode (a, g) g = addNode (b, g) g = addEdge (a, b, g) edgeData (g, a, b, 'source') = source edgeData (g, a, b, 'edgeType') = method } The 16k nodes and their edges are from a suitably parsed version of all of the reactions reported by KEGG. Is this user error, user misconception, ... or maybe an inefficiency that future versions of the graph package could improve upon? sessionInfo () R version 2.5.1 (2007-06-27) i386-apple-darwin8.9.1 locale: C attached base packages: [1] "stats" "graphics" "grDevices" "utils" "datasets" "methods" "base" other attached packages: graph RUnit "1.14.0" "0.4.15"

graph graph • 1.4k views

ADD COMMENT • link updated 16.6 years ago by Seth Falcon ★ 7.4k • written 16.6 years ago by Paul Shannon ★ 1.1k

score 0 · Answer 1 · 2007-09-11

0

Entering edit mode

Seth Falcon ★ 7.4k

@seth-falcon-992

Last seen 9.6 years ago

Hi Paul, Paul Shannon <pshannon at="" systemsbiology.org=""> writes: > It took nearly 24 hours (!) to create a 16k node graph using two > different techniques: > > g = fromGXL (file ('someFile.gxl')) > > and > > g = new ('graphNEL', edgemode='undirected') > edgeDataDefaults (g, attr='edgeType') = 'edge' > edgeDataDefaults (g, attr='source') = 'unknown' > > ... > > for (r in 1:max) { > ... > g = addNode (a, g) > g = addNode (b, g) > g = addEdge (a, b, g) > edgeData (g, a, b, 'source') = source > edgeData (g, a, b, 'edgeType') = method > } > > The 16k nodes and their edges are from a suitably parsed version of > all of the reactions > reported by KEGG. > > Is this user error, user misconception, ... or maybe an inefficiency > that future versions > of the graph package could improve upon? It looks like fromGXL is doing something quite similar to the for loop you describe above. As you have demonstrated, this is not the most efficient way to construct graph objects. The immediate reason is that each call to addNode, addEdge, and edgeData creates a new copy of the _entire_ graph. We will look into this and see if we can provide some relief for fromGXL. Thanks for the report. + seth -- Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center BioC: http://bioconductor.org/ Blog: http://userprimary.net/user/

ADD COMMENT • link 16.6 years ago Seth Falcon ★ 7.4k

0

Entering edit mode

Hi Seth, If you are digging around in the innards of the graph package, I have another suggestion -- an imperfect one -- to suggest. I sometimes use biomaRt and graph in the same project. biomaRt requires XML which has, like graph, a method called 'addMode': library (biomaRt) Loading required package: XML Attaching package: 'XML' The following object(s) are masked from package:graph : addNode I can work around this just fine by calling graph::addNode, but maybe an alias could be adopted as well, and then favored over the long term -- 'add.node' or some such thing. Or maybe this isn't worth bothering with. - Paul > Hi Paul, > > Paul Shannon <pshannon at="" systemsbiology.org=""> writes: >> It took nearly 24 hours (!) to create a 16k node graph using two >> different techniques: >> >> g = fromGXL (file ('someFile.gxl')) >> >> and >> >> g = new ('graphNEL', edgemode='undirected') >> edgeDataDefaults (g, attr='edgeType') = 'edge' >> edgeDataDefaults (g, attr='source') = 'unknown' >> >> ... >> >> for (r in 1:max) { >> ... >> g = addNode (a, g) >> g = addNode (b, g) >> g = addEdge (a, b, g) >> edgeData (g, a, b, 'source') = source >> edgeData (g, a, b, 'edgeType') = method >> } >> >> The 16k nodes and their edges are from a suitably parsed version of >> all of the reactions >> reported by KEGG. >> >> Is this user error, user misconception, ... or maybe an inefficiency >> that future versions >> of the graph package could improve upon? > > It looks like fromGXL is doing something quite similar to the for loop > you describe above. As you have demonstrated, this is not the most > efficient way to construct graph objects. The immediate reason is > that each call to addNode, addEdge, and edgeData creates a new copy of > the _entire_ graph. > > We will look into this and see if we can provide some relief for > fromGXL. Thanks for the report. > > + seth > > -- > Seth Falcon | Computational Biology | Fred Hutchinson Cancer > Research Center > BioC: http://bioconductor.org/ > Blog: http://userprimary.net/user/

ADD REPLY • link 16.6 years ago Paul Shannon ★ 1.1k

score 0 · Answer 2 · 2007-09-12

0

Entering edit mode

Seth Falcon ★ 7.4k

@seth-falcon-992

Last seen 9.6 years ago

Hi again, Seth Falcon <sfalcon at="" fhcrc.org=""> writes: > Paul Shannon <pshannon at="" systemsbiology.org=""> writes: >> It took nearly 24 hours (!) to create a 16k node graph using two >> different techniques: >> >> g = fromGXL (file ('someFile.gxl')) Using a patched version of fromGXL (on my laptop) I get: > library(graph) > con = file("kegg.yeast.gxl", open="r") > system.time(z <- fromGXL(con)) user system elapsed 104.366 0.570 105.070 > z A graphNEL graph with undirected edges Number of Nodes = 15158 Number of Edges = 32668 > validObject(z) [1] TRUE That's over 800x faster :-) I've checked in the changes in devel as graph 1.15.17. You can get it from svn or wait till this time tomorrow (in either case, you will need R 2.6.0 alpha). The code passes the existing unit tests, but some extra scrutiny that the returned graphs are as desired is quite welcome. + seth -- Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center BioC: http://bioconductor.org/ Blog: http://userprimary.net/user/

ADD COMMENT • link 16.6 years ago Seth Falcon ★ 7.4k

0

Entering edit mode

A couple of weeks ago, just before he left, Seth made a striking (800x) improvement to the speed of graph::fromGXL. This has been a big help. However, I'd like to construct graphs incrementally, and programmatically, without having to create new gxl files. Is it possible that Seth's efficiencies have not yet been applied to other methods in the graph class? In particular, it seems that the following operations take a longish time on graphs of a few thousand nodes: edgeData nodeData addEdge addNode I could provide actual times and test cases if that would help. Thanks! - Paul On Sep 12, 2007, at 8:09 AM, Seth Falcon wrote: > Hi again, > > Seth Falcon <sfalcon at="" fhcrc.org=""> writes: >> Paul Shannon <pshannon at="" systemsbiology.org=""> writes: >>> It took nearly 24 hours (!) to create a 16k node graph using two >>> different techniques: >>> >>> g = fromGXL (file ('someFile.gxl')) > > Using a patched version of fromGXL (on my laptop) I get: > >> library(graph) >> con = file("kegg.yeast.gxl", open="r") >> system.time(z <- fromGXL(con)) > user system elapsed > 104.366 0.570 105.070 >> z > A graphNEL graph with undirected edges > Number of Nodes = 15158 > Number of Edges = 32668 >> validObject(z) > [1] TRUE > > That's over 800x faster :-) > > I've checked in the changes in devel as graph 1.15.17. You can get it > from svn or wait till this time tomorrow (in either case, you will > need R 2.6.0 alpha). > > The code passes the existing unit tests, but some extra scrutiny that > the returned graphs are as desired is quite welcome. > > + seth > > -- > Seth Falcon | Computational Biology | Fred Hutchinson Cancer > Research Center > BioC: http://bioconductor.org/ > Blog: http://userprimary.net/user/

ADD REPLY • link 16.6 years ago Paul Shannon ★ 1.1k

0

Entering edit mode

Dear Paul, there is almost surely scope for making performance increases in any of these functions (and patches are welcome), but it is also worth noting that the fact that R is a functional language with mostly a "pass-by-value" semantics will often put limits to this. A syntax like nodeData(g, "node1", "flavour") = "strawberry" really evaluates to g = "nodeData<-"(g, "node1", "flavour", "strawberry") and a full copy of g is made and eventually destroyed in memory each time you call this. Similarly for addNode etc. Until someone writes more performant graph classes (note that "graphNEL", which you are probably using, is only one of the several subclasses of "graph"), a pragmatic workaround is: 1. If you know the node set in advance, your edges do not have complex attributes, and you just want to incrementally add and remove edges, you could use the base-R matrix class to store the graph's adjacency matrix, and only in the end convert it to a "graph" object (see the man page "? ftM2graphNEL"). I believe some of the native data types in R such as numeric vectors and matrices have much more efficient replacement methods "[<-" than what I said above for "nodeData<-" - they are able to change an object in situ. 2. Alternatively, you could use an environment to store objects describing your nodes and edges. Adding and removing things from an environment is very efficient / cheap. Again, once you are done, you could convert into the cleaner and more robst S4 graph object. Best wishes Wolfgang ------------------------------------------------------------------ Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber Shannon ha scritto: > A couple of weeks ago, just before he left, Seth made a striking (800x) > improvement to the speed of graph::fromGXL. This has been a big help. > > However, I'd like to construct graphs incrementally, and > programmatically, > without having to create new gxl files. Is it possible that > Seth's efficiencies have not yet been applied to other methods in the > graph class? > > In particular, it seems that the following operations take a longish > time > on graphs of a few thousand nodes: > > edgeData > nodeData > addEdge > addNode > > I could provide actual times and test cases if that would help. > > Thanks! > > - Paul > > > > On Sep 12, 2007, at 8:09 AM, Seth Falcon wrote: > >> Hi again, >> >> Seth Falcon <sfalcon at="" fhcrc.org=""> writes: >>> Paul Shannon <pshannon at="" systemsbiology.org=""> writes: >>>> It took nearly 24 hours (!) to create a 16k node graph using two >>>> different techniques: >>>> >>>> g = fromGXL (file ('someFile.gxl')) >> Using a patched version of fromGXL (on my laptop) I get: >> >>> library(graph) >>> con = file("kegg.yeast.gxl", open="r") >>> system.time(z <- fromGXL(con)) >> user system elapsed >> 104.366 0.570 105.070 >>> z >> A graphNEL graph with undirected edges >> Number of Nodes = 15158 >> Number of Edges = 32668 >>> validObject(z) >> [1] TRUE >> >> That's over 800x faster :-) >> >> I've checked in the changes in devel as graph 1.15.17. You can get it >> from svn or wait till this time tomorrow (in either case, you will >> need R 2.6.0 alpha). >> >> The code passes the existing unit tests, but some extra scrutiny that >> the returned graphs are as desired is quite welcome. >> >> + seth >> >> -- >> Seth Falcon | Computational Biology | Fred Hutchinson Cancer >> Research Center >> BioC: http://bioconductor.org/ >> Blog: http://userprimary.net/user/ >

ADD REPLY • link 16.6 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Hi Wolfgang, Thanks for your explanations. I pack quite a lot of information into node and (especially) edge attributes, so your first suggestion, ftM2graphNEL, does not look so promising. Your second suggestion sounds like it might work. Is there documentation somewhere that shows how to convert an environment into an S4 graph object? And perhaps the complementary conversion, from a graph into an environment? Cheers, - Paul > 2. Alternatively, you could use an environment to store objects > describing > your nodes and edges. Adding and removing things from an > environment is > very efficient / cheap. Again, once you are done, you could convert > into the > cleaner and more robst S4 graph object.

ADD REPLY • link 16.6 years ago Paul Shannon ★ 1.1k

0

Entering edit mode

Dear Paul, > Your second suggestion sounds like it might work. Is there documentation > somewhere that shows how to convert an environment into an S4 graph object? > And perhaps the complementary conversion, from a graph into an environment? Seth's and Martin' answers pretty much cover what I was thinking. Beyond Seth's, I know of no other existing solutions. (I.e. you could consider this an opportunity to write and contribute your own 'efficient' graph class based on the virtual "graph" class in the "graph" package, and with similar functionality as "graphNEL" but different internal implementation. The nice thing about Bioconductors "graph" class is that it allows to do that. I and others have dabbled in using one of the sparse matrix implementations for R for storing the adjacency matrix - but none of these attempts are mature, afaIk (?). Best wishes Wolfgang ------------------------------------------------------------------ Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber

ADD REPLY • link 16.6 years ago Wolfgang Huber ★ 13k