rBiopaxParser, Reactome and namespaces

0

Entering edit mode

Paul Shannon ▴ 470

@paul-shannon-5944

Last seen 23 months ago

United States

Hi Frank, I am most happiliy using the rBiopaxParser package, and your vignette, in order to extract detailed (but topologically simple) interaction data from the latest Reactome "Homosapiens.owl". Your package offers great power and convenience. However, I run into difficulty with namespaces. For a simple example, consider this one line from the method listIntances, found in the file R/selectBiopax.R: sel = sel & (tolower(biopax$df$class) %in% tolower(stripns(class))) As parsed from Homosapiens.owl, the class column of biopax$df has values like these, always containing a namespace prefix: head(unique(biopax$df$class)) "bp:BiochemicalReaction" "bp:Protein" "bp:CellularLocationVocabulary" "bp:UnificationXref" "bp:ProteinReference" "bp:BioSource" By stripping the namespace off of "bp:Protein" (the right hand side of the %in% clause) it cannot match the biopax$df$class value, as parsed from the owl file (which preserves the "bp:"). I believe I see similar logic in other places, with these methods specifically encountered so far: selectInstances listPathwayComponents Namespaces are used with the "property" column as well: head(table(biopax$df$property), n=3) bp:author bp:cellularLocation bp:comment 55654 23838 123750 Speaking from the nickel seats, and not claiming to understand all of the implications: perhaps these could be neatly avoided if your readBiopax method could optionally eliminate namespaces when reading in an owl file? Thanks, - Paul

rBiopaxParser rBiopaxParser • 1.3k views

ADD COMMENT • link updated 10.9 years ago by Frank Kramer ▴ 60 • written 10.9 years ago by Paul Shannon ▴ 470

0

Entering edit mode

Frank Kramer ▴ 60

@frank-kramer-5951

Last seen 6.9 years ago

Germany

Dear Paul, thank you for the report. I absolutely agree with you, the namespaces of an OWL (and XML/RDF) file are not fixed and can vary between pathway database providers. Namespace identifiers are, or at least should be, removed from instances during parsing. As you noticed I strip namespaces off the input parameters since you should not be able to find anything if you include them and to add a bit of robustness as well. It seems this did not work very well in your case ;-) Unfortunately I could not reproduce your problem: ####### CODE library(rBiopaxParser) #reactome urls changed so what used to link to biopax2 is now biopax3. #this is for shortness of example code, I also tried this with manually #downloading the owls file=downloadBiopaxData(database="reactome",model="reactome",version=" biopax2") biopax = readBiopax(file, verbose=T) head(biopax$df) ####### OUTPUT Found a BioPAX level 3 OWL. Parsing... [Info Verbose] Parsing Biopax-Model as a data.frame... (...) [Info Verbose] Finished! Created a data.frame with 1000689 rows within only 3591.365 seconds. > head(biopax$df) class id property property_attr 1 BiochemicalReaction BiochemicalReaction1 left rdf:resource 2 BiochemicalReaction BiochemicalReaction1 left rdf:resource 3 BiochemicalReaction BiochemicalReaction1 left rdf:resource 4 BiochemicalReaction BiochemicalReaction1 right rdf:resource 5 BiochemicalReaction BiochemicalReaction1 right rdf:resource 6 BiochemicalReaction BiochemicalReaction1 eCNumber rdf:datatype property_attr_value property_value 1 #Complex1 2 #Complex2 3 #Protein12 4 #SmallMolecule1 5 #Complex3 6 http://www.w3.org/2001/XMLSchema#string 3.1.3.48 > head(unique(biopax$df$class)) [1] BiochemicalReaction Complex [3] CellularLocationVocabulary UnificationXref [5] Protein ProteinReference 33 Levels: BiochemicalReaction BioSource ... UnificationXref > head(table(biopax$df$property), n=3) author cellularLocation comment 65131 24758 130936 > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] RCurl_1.95-4.1 bitops_1.0-5 rBiopaxParser_1.0.0 loaded via a namespace (and not attached): [1] XML_3.96-1.1 ####### END Can you check biopax$namespaces, to see if any namespaces were detected during parsing? These are saved in order to reuse them if you want to write out a new Biopax OWL file later on. Can you check if you are using the current release/devel version of the rBiopaxParser? Best wishes, Frank University Medical Center G?ttingen Department for Medical Statistics Humboldtallee 32 37073 G?ttingen Germany Phone: +49 (0) 551 39-10710 Fax: +49 (0) 551 39-4995 http://www.ams.med.uni-goettingen.de/amsneu/kramer-en.html Am 22.05.2013 05:08, schrieb Paul Shannon: > Hi Frank, > > I am most happiliy using the rBiopaxParser package, and your vignette, in order to extract detailed (but topologically simple) interaction data from the latest Reactome "Homosapiens.owl". Your package offers great power and convenience. > > However, I run into difficulty with namespaces. > > For a simple example, consider this one line from the method listIntances, found in the file R/selectBiopax.R: > > sel = sel & (tolower(biopax$df$class) %in% tolower(stripns(class))) > > As parsed from Homosapiens.owl, the class column of biopax$df has values like these, always containing a namespace prefix: > > head(unique(biopax$df$class)) > "bp:BiochemicalReaction" "bp:Protein" > "bp:CellularLocationVocabulary" "bp:UnificationXref" > "bp:ProteinReference" "bp:BioSource" > > By stripping the namespace off of "bp:Protein" (the right hand side of the %in% clause) it cannot match the biopax$df$class value, as parsed from the owl file (which preserves the "bp:"). > > I believe I see similar logic in other places, with these methods specifically encountered so far: > > selectInstances > listPathwayComponents > > Namespaces are used with the "property" column as well: > > head(table(biopax$df$property), n=3) > bp:author bp:cellularLocation bp:comment > 55654 23838 123750 > > Speaking from the nickel seats, and not claiming to understand all of the implications: perhaps these could be neatly avoided if your readBiopax method could optionally eliminate namespaces when reading in an owl file? > > Thanks, > > - Paul > >

ADD COMMENT • link 10.9 years ago Frank Kramer ▴ 60

0

Entering edit mode

Hi Frank, Thanks for your reply - I apologize for my delayed follow up. I think I can now give you a reproducible example, using Reactome's latest "Xenopus laevis" owl file -- one of the smallest files they offer. I will mail you a gzipped version of this file off-list, just to be sure we are both have the same test data. I am using rBiopaxParser_1.1.1 -- full sessionInfo shown below. library(rBiopaxParser) frog.bp <- readBiopax("Xenopus_laevis.owl") # if I let the "class" arg default to NULL, I get all a data.frame with all instances, totaling 1572 dim (selectInstances(frog.bp)) # 1572 6 # subset that data.frame to get only the BiochemicalReactions dim(subset(selectInstances(frog.bp), class=="bp:BiochemicalReaction")) # 91 6 # try to replicate this by using the "class" arg to selectInstances. # both versions fail, with and without a namespace dim(selectInstances(frog.bp, class="bp:BiochemicalReaction")) # 0 6 dim(selectInstances(frog.bp, class="BiochemicalReaction")) # 0 6 I won't exclude the possibility that I am doing something dumb! - Paul > sessionInfo() R version 3.0.0 (2013-04-03) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rBiopaxParser_1.1.1 RUnit_0.4.26 BiocInstaller_1.11.1 loaded via a namespace (and not attached): [1] XML_3.95-0.2 compiler_3.0.0 tools_3.0.0 On May 23, 2013, at 5:30 AM, Frank Kramer wrote: > Dear Paul, > > thank you for the report. I absolutely agree with you, the namespaces of an OWL (and XML/RDF) file are not fixed and can vary between pathway database providers. Namespace identifiers are, or at least should be, removed from instances during parsing. As you noticed I strip namespaces off the input parameters since you should not be able to find anything if you include them and to add a bit of robustness as well. > It seems this did not work very well in your case ;-) > > Unfortunately I could not reproduce your problem: > > ####### CODE > library(rBiopaxParser) > #reactome urls changed so what used to link to biopax2 is now biopax3. > #this is for shortness of example code, I also tried this with manually #downloading the owls > file=downloadBiopaxData(database="reactome",model="reactome",version ="biopax2") > biopax = readBiopax(file, verbose=T) > head(biopax$df) > > ####### OUTPUT > Found a BioPAX level 3 OWL. Parsing... > [Info Verbose] Parsing Biopax-Model as a data.frame... > (...) > [Info Verbose] Finished! Created a data.frame with 1000689 rows within only 3591.365 seconds. > > head(biopax$df) > class id property property_attr > 1 BiochemicalReaction BiochemicalReaction1 left rdf:resource > 2 BiochemicalReaction BiochemicalReaction1 left rdf:resource > 3 BiochemicalReaction BiochemicalReaction1 left rdf:resource > 4 BiochemicalReaction BiochemicalReaction1 right rdf:resource > 5 BiochemicalReaction BiochemicalReaction1 right rdf:resource > 6 BiochemicalReaction BiochemicalReaction1 eCNumber rdf:datatype > property_attr_value property_value > 1 #Complex1 > 2 #Complex2 > 3 #Protein12 > 4 #SmallMolecule1 > 5 #Complex3 > 6 http://www.w3.org/2001/XMLSchema#string 3.1.3.48 > > > head(unique(biopax$df$class)) > [1] BiochemicalReaction Complex > [3] CellularLocationVocabulary UnificationXref > [5] Protein ProteinReference > 33 Levels: BiochemicalReaction BioSource ... UnificationXref > > > head(table(biopax$df$property), n=3) > author cellularLocation comment > 65131 24758 130936 > > > sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C > [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 > [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] RCurl_1.95-4.1 bitops_1.0-5 rBiopaxParser_1.0.0 > > loaded via a namespace (and not attached): > [1] XML_3.96-1.1 > ####### END > > Can you check biopax$namespaces, to see if any namespaces were detected during parsing? These are saved in order to reuse them if you want to write out a new Biopax OWL file later on. > > Can you check if you are using the current release/devel version of the rBiopaxParser? > > > Best wishes, > Frank > > > University Medical Center G?ttingen > Department for Medical Statistics > Humboldtallee 32 > 37073 G?ttingen > Germany > > Phone: +49 (0) 551 39-10710 > Fax: +49 (0) 551 39-4995 > > http://www.ams.med.uni-goettingen.de/amsneu/kramer-en.html > > Am 22.05.2013 05:08, schrieb Paul Shannon: >> Hi Frank, >> >> I am most happiliy using the rBiopaxParser package, and your vignette, in order to extract detailed (but topologically simple) interaction data from the latest Reactome "Homosapiens.owl". Your package offers great power and convenience. >> >> However, I run into difficulty with namespaces. >> >> For a simple example, consider this one line from the method listIntances, found in the file R/selectBiopax.R: >> >> sel = sel & (tolower(biopax$df$class) %in% tolower(stripns(class))) >> >> As parsed from Homosapiens.owl, the class column of biopax$df has values like these, always containing a namespace prefix: >> >> head(unique(biopax$df$class)) >> "bp:BiochemicalReaction" "bp:Protein" >> "bp:CellularLocationVocabulary" "bp:UnificationXref" >> "bp:ProteinReference" "bp:BioSource" >> >> By stripping the namespace off of "bp:Protein" (the right hand side of the %in% clause) it cannot match the biopax$df$class value, as parsed from the owl file (which preserves the "bp:"). >> >> I believe I see similar logic in other places, with these methods specifically encountered so far: >> >> selectInstances >> listPathwayComponents >> >> Namespaces are used with the "property" column as well: >> >> head(table(biopax$df$property), n=3) >> bp:author bp:cellularLocation bp:comment >> 55654 23838 123750 >> >> Speaking from the nickel seats, and not claiming to understand all of the implications: perhaps these could be neatly avoided if your readBiopax method could optionally eliminate namespaces when reading in an owl file? >> >> Thanks, >> >> - Paul >> >>

ADD REPLY • link 10.9 years ago Paul Shannon ▴ 470

0

Entering edit mode

Hi Paul, I am quite sure we are dealing with a configuration issue here. First I tried the code and OWL you supplied, however I got exactly the opposite outcome: Namespaces were removed from the parsed data (using fresh R 3.0.1 and rBiopaxParser 1.0 and 1.1.0) and the results were: dim(subset(selectInstances(frog.bp), class=="BiochemicalReaction")) # 91 6 dim(subset(selectInstances(frog.bp), class=="bp:BiochemicalReaction")) # 0 6 dim(selectInstances(frog.bp, class="bp:BiochemicalReaction")) # 91 6 dim(selectInstances(frog.bp, class="BiochemicalReaction")) # 91 6 Can you re-run your example and post the output of: ---- library(XML) xmlNamespaceDefinitions(xmlRoot(xmlInternalTreeParse("Xenopus_laevis.o wl.gz"))) ---- This should like all namespaces the XML package finds on this file (and they should get removed during parsing). For me the output is: ---- $rdf $id [1] "rdf" $uri [1] "http://www.w3.org/1999/02/22-rdf-syntax-ns#" $local [1] TRUE attr(,"class") [1] "XMLNamespaceDefinition" $bp $id [1] "bp" $uri [1] "http://www.biopax.org/release/biopax-level3.owl#" $local [1] TRUE attr(,"class") [1] "XMLNamespaceDefinition" [... and some more] ---- I suspect this might be related to either 1, a configuration issue of libxml2 on your machine 2, a configuration issue of the XML package on your machine 3, or a more general problem with the XML package under Mac OS I am unfortunately not really familiar with Mac OS, but these would be my suggestions: Under Linux the command "xml2-config --version" prints the version of the libxml2 library (this is the underlying library, the R XML package is a wrapper for this). I am currently on version 2.8.0. Can you make sure you have libxml2 installed? Can you re-install the XML R package and pay special attention to possible configuration warnings? Best wishes, Frank University Medical Center G?ttingen Department for Medical Statistics Humboldtallee 32 37073 G?ttingen Germany Phone: +49 (0) 551 39-10710 Fax: +49 (0) 551 39-4995 http://www.ams.med.uni-goettingen.de/amsneu/kramer-en.html Am 27.05.2013 20:40, schrieb Paul Shannon: > Hi Frank, > > Thanks for your reply - I apologize for my delayed follow up. > > I think I can now give you a reproducible example, using Reactome's latest "Xenopus laevis" owl file -- one of the smallest files they offer. I will mail you a gzipped version of this file off-list, just to be sure we are both have the same test data. > > I am using rBiopaxParser_1.1.1 -- full sessionInfo shown below. > > library(rBiopaxParser) > frog.bp <- readBiopax("Xenopus_laevis.owl") > > # if I let the "class" arg default to NULL, I get all a data.frame with all instances, totaling 1572 > dim (selectInstances(frog.bp)) # 1572 6 > > # subset that data.frame to get only the BiochemicalReactions > dim(subset(selectInstances(frog.bp), class=="bp:BiochemicalReaction")) # 91 6 > > # try to replicate this by using the "class" arg to selectInstances. > # both versions fail, with and without a namespace > dim(selectInstances(frog.bp, class="bp:BiochemicalReaction")) # 0 6 > dim(selectInstances(frog.bp, class="BiochemicalReaction")) # 0 6 > > I won't exclude the possibility that I am doing something dumb! > > - Paul > >> sessionInfo() > R version 3.0.0 (2013-04-03) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] rBiopaxParser_1.1.1 RUnit_0.4.26 BiocInstaller_1.11.1 > > loaded via a namespace (and not attached): > [1] XML_3.95-0.2 compiler_3.0.0 tools_3.0.0 > > > On May 23, 2013, at 5:30 AM, Frank Kramer wrote: > >> Dear Paul, >> >> thank you for the report. I absolutely agree with you, the namespaces of an OWL (and XML/RDF) file are not fixed and can vary between pathway database providers. Namespace identifiers are, or at least should be, removed from instances during parsing. As you noticed I strip namespaces off the input parameters since you should not be able to find anything if you include them and to add a bit of robustness as well. >> It seems this did not work very well in your case ;-) >> >> Unfortunately I could not reproduce your problem: >> >> ####### CODE >> library(rBiopaxParser) >> #reactome urls changed so what used to link to biopax2 is now biopax3. >> #this is for shortness of example code, I also tried this with manually #downloading the owls >> file=downloadBiopaxData(database="reactome",model="reactome",versio n="biopax2") >> biopax = readBiopax(file, verbose=T) >> head(biopax$df) >> >> ####### OUTPUT >> Found a BioPAX level 3 OWL. Parsing... >> [Info Verbose] Parsing Biopax-Model as a data.frame... >> (...) >> [Info Verbose] Finished! Created a data.frame with 1000689 rows within only 3591.365 seconds. >>> head(biopax$df) >> class id property property_attr >> 1 BiochemicalReaction BiochemicalReaction1 left rdf:resource >> 2 BiochemicalReaction BiochemicalReaction1 left rdf:resource >> 3 BiochemicalReaction BiochemicalReaction1 left rdf:resource >> 4 BiochemicalReaction BiochemicalReaction1 right rdf:resource >> 5 BiochemicalReaction BiochemicalReaction1 right rdf:resource >> 6 BiochemicalReaction BiochemicalReaction1 eCNumber rdf:datatype >> property_attr_value property_value >> 1 #Complex1 >> 2 #Complex2 >> 3 #Protein12 >> 4 #SmallMolecule1 >> 5 #Complex3 >> 6 http://www.w3.org/2001/XMLSchema#string 3.1.3.48 >> >>> head(unique(biopax$df$class)) >> [1] BiochemicalReaction Complex >> [3] CellularLocationVocabulary UnificationXref >> [5] Protein ProteinReference >> 33 Levels: BiochemicalReaction BioSource ... UnificationXref >> >>> head(table(biopax$df$property), n=3) >> author cellularLocation comment >> 65131 24758 130936 >> >>> sessionInfo() >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 >> [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] RCurl_1.95-4.1 bitops_1.0-5 rBiopaxParser_1.0.0 >> >> loaded via a namespace (and not attached): >> [1] XML_3.96-1.1 >> ####### END >> >> Can you check biopax$namespaces, to see if any namespaces were detected during parsing? These are saved in order to reuse them if you want to write out a new Biopax OWL file later on. >> >> Can you check if you are using the current release/devel version of the rBiopaxParser? >> >> >> Best wishes, >> Frank >> >> >> University Medical Center G?ttingen >> Department for Medical Statistics >> Humboldtallee 32 >> 37073 G?ttingen >> Germany >> >> Phone: +49 (0) 551 39-10710 >> Fax: +49 (0) 551 39-4995 >> >> http://www.ams.med.uni-goettingen.de/amsneu/kramer-en.html >> >> Am 22.05.2013 05:08, schrieb Paul Shannon: >>> Hi Frank, >>> >>> I am most happiliy using the rBiopaxParser package, and your vignette, in order to extract detailed (but topologically simple) interaction data from the latest Reactome "Homosapiens.owl". Your package offers great power and convenience. >>> >>> However, I run into difficulty with namespaces. >>> >>> For a simple example, consider this one line from the method listIntances, found in the file R/selectBiopax.R: >>> >>> sel = sel & (tolower(biopax$df$class) %in% tolower(stripns(class))) >>> >>> As parsed from Homosapiens.owl, the class column of biopax$df has values like these, always containing a namespace prefix: >>> >>> head(unique(biopax$df$class)) >>> "bp:BiochemicalReaction" "bp:Protein" >>> "bp:CellularLocationVocabulary" "bp:UnificationXref" >>> "bp:ProteinReference" "bp:BioSource" >>> >>> By stripping the namespace off of "bp:Protein" (the right hand side of the %in% clause) it cannot match the biopax$df$class value, as parsed from the owl file (which preserves the "bp:"). >>> >>> I believe I see similar logic in other places, with these methods specifically encountered so far: >>> >>> selectInstances >>> listPathwayComponents >>> >>> Namespaces are used with the "property" column as well: >>> >>> head(table(biopax$df$property), n=3) >>> bp:author bp:cellularLocation bp:comment >>> 55654 23838 123750 >>> >>> Speaking from the nickel seats, and not claiming to understand all of the implications: perhaps these could be neatly avoided if your readBiopax method could optionally eliminate namespaces when reading in an owl file? >>> >>> Thanks, >>> >>> - Paul >>> >>> >

ADD REPLY • link 10.9 years ago Frank Kramer ▴ 60

0

Entering edit mode

Hi Frank, You were right when you said: > I am quite sure we are dealing with a configuration issue here. I upgraded to libxml2 version 2.9.0, and XML (the R package) to 3.96-1.1. Apparently both updates were needed. Thanks very much for helping me work through this. Cheers! - Paul On May 28, 2013, at 7:20 AM, Frank Kramer wrote: > Hi Paul, > > I am quite sure we are dealing with a configuration issue here. > > First I tried the code and OWL you supplied, however I got exactly the opposite outcome: > Namespaces were removed from the parsed data (using fresh R 3.0.1 and rBiopaxParser 1.0 and 1.1.0) and the results were: > > dim(subset(selectInstances(frog.bp), class=="BiochemicalReaction")) > # 91 6 > dim(subset(selectInstances(frog.bp), class=="bp:BiochemicalReaction")) > # 0 6 > > dim(selectInstances(frog.bp, class="bp:BiochemicalReaction")) > # 91 6 > dim(selectInstances(frog.bp, class="BiochemicalReaction")) > # 91 6 > > Can you re-run your example and post the output of: > ---- > library(XML) > xmlNamespaceDefinitions(xmlRoot(xmlInternalTreeParse("Xenopus_laevis .owl.gz"))) > ---- > This should like all namespaces the XML package finds on this file (and they should get removed during parsing). For me the output is: > ---- > $rdf > $id > [1] "rdf" > > $uri > [1] "http://www.w3.org/1999/02/22-rdf-syntax-ns#" > > $local > [1] TRUE > > attr(,"class") > [1] "XMLNamespaceDefinition" > > $bp > $id > [1] "bp" > > $uri > [1] "http://www.biopax.org/release/biopax-level3.owl#" > > $local > [1] TRUE > > attr(,"class") > [1] "XMLNamespaceDefinition" > > [... and some more] > ---- > > > I suspect this might be related to either > 1, a configuration issue of libxml2 on your machine > 2, a configuration issue of the XML package on your machine > 3, or a more general problem with the XML package under Mac OS > > I am unfortunately not really familiar with Mac OS, but these would be my suggestions: > Under Linux the command "xml2-config --version" prints the version of the libxml2 library (this is the underlying library, the R XML package is a wrapper for this). I am currently on version 2.8.0. > > Can you make sure you have libxml2 installed? > > Can you re-install the XML R package and pay special attention to possible configuration warnings? > > > > > Best wishes, > Frank > > > University Medical Center G?ttingen > Department for Medical Statistics > Humboldtallee 32 > 37073 G?ttingen > Germany > > Phone: +49 (0) 551 39-10710 > Fax: +49 (0) 551 39-4995 > > http://www.ams.med.uni-goettingen.de/amsneu/kramer-en.html > > Am 27.05.2013 20:40, schrieb Paul Shannon: >> Hi Frank, >> >> Thanks for your reply - I apologize for my delayed follow up. >> >> I think I can now give you a reproducible example, using Reactome's latest "Xenopus laevis" owl file -- one of the smallest files they offer. I will mail you a gzipped version of this file off-list, just to be sure we are both have the same test data. >> >> I am using rBiopaxParser_1.1.1 -- full sessionInfo shown below. >> >> library(rBiopaxParser) >> frog.bp <- readBiopax("Xenopus_laevis.owl") >> >> # if I let the "class" arg default to NULL, I get all a data.frame with all instances, totaling 1572 >> dim (selectInstances(frog.bp)) # 1572 6 >> >> # subset that data.frame to get only the BiochemicalReactions >> dim(subset(selectInstances(frog.bp), class=="bp:BiochemicalReaction")) # 91 6 >> >> # try to replicate this by using the "class" arg to selectInstances. >> # both versions fail, with and without a namespace >> dim(selectInstances(frog.bp, class="bp:BiochemicalReaction")) # 0 6 >> dim(selectInstances(frog.bp, class="BiochemicalReaction")) # 0 6 >> >> I won't exclude the possibility that I am doing something dumb! >> >> - Paul >> >>> sessionInfo() >> R version 3.0.0 (2013-04-03) >> Platform: x86_64-apple-darwin10.8.0 (64-bit) >> >> locale: >> [1] C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] rBiopaxParser_1.1.1 RUnit_0.4.26 BiocInstaller_1.11.1 >> >> loaded via a namespace (and not attached): >> [1] XML_3.95-0.2 compiler_3.0.0 tools_3.0.0 >> >> >> On May 23, 2013, at 5:30 AM, Frank Kramer wrote: >> >>> Dear Paul, >>> >>> thank you for the report. I absolutely agree with you, the namespaces of an OWL (and XML/RDF) file are not fixed and can vary between pathway database providers. Namespace identifiers are, or at least should be, removed from instances during parsing. As you noticed I strip namespaces off the input parameters since you should not be able to find anything if you include them and to add a bit of robustness as well. >>> It seems this did not work very well in your case ;-) >>> >>> Unfortunately I could not reproduce your problem: >>> >>> ####### CODE >>> library(rBiopaxParser) >>> #reactome urls changed so what used to link to biopax2 is now biopax3. >>> #this is for shortness of example code, I also tried this with manually #downloading the owls >>> file=downloadBiopaxData(database="reactome",model="reactome",versi on="biopax2") >>> biopax = readBiopax(file, verbose=T) >>> head(biopax$df) >>> >>> ####### OUTPUT >>> Found a BioPAX level 3 OWL. Parsing... >>> [Info Verbose] Parsing Biopax-Model as a data.frame... >>> (...) >>> [Info Verbose] Finished! Created a data.frame with 1000689 rows within only 3591.365 seconds. >>>> head(biopax$df) >>> class id property property_attr >>> 1 BiochemicalReaction BiochemicalReaction1 left rdf:resource >>> 2 BiochemicalReaction BiochemicalReaction1 left rdf:resource >>> 3 BiochemicalReaction BiochemicalReaction1 left rdf:resource >>> 4 BiochemicalReaction BiochemicalReaction1 right rdf:resource >>> 5 BiochemicalReaction BiochemicalReaction1 right rdf:resource >>> 6 BiochemicalReaction BiochemicalReaction1 eCNumber rdf:datatype >>> property_attr_value property_value >>> 1 #Complex1 >>> 2 #Complex2 >>> 3 #Protein12 >>> 4 #SmallMolecule1 >>> 5 #Complex3 >>> 6 http://www.w3.org/2001/XMLSchema#string 3.1.3.48 >>> >>>> head(unique(biopax$df$class)) >>> [1] BiochemicalReaction Complex >>> [3] CellularLocationVocabulary UnificationXref >>> [5] Protein ProteinReference >>> 33 Levels: BiochemicalReaction BioSource ... UnificationXref >>> >>>> head(table(biopax$df$property), n=3) >>> author cellularLocation comment >>> 65131 24758 130936 >>> >>>> sessionInfo() >>> R version 3.0.1 (2013-05-16) >>> Platform: x86_64-pc-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 >>> [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] RCurl_1.95-4.1 bitops_1.0-5 rBiopaxParser_1.0.0 >>> >>> loaded via a namespace (and not attached): >>> [1] XML_3.96-1.1 >>> ####### END >>> >>> Can you check biopax$namespaces, to see if any namespaces were detected during parsing? These are saved in order to reuse them if you want to write out a new Biopax OWL file later on. >>> >>> Can you check if you are using the current release/devel version of the rBiopaxParser? >>> >>> >>> Best wishes, >>> Frank >>> >>> >>> University Medical Center G?ttingen >>> Department for Medical Statistics >>> Humboldtallee 32 >>> 37073 G?ttingen >>> Germany >>> >>> Phone: +49 (0) 551 39-10710 >>> Fax: +49 (0) 551 39-4995 >>> >>> http://www.ams.med.uni-goettingen.de/amsneu/kramer-en.html >>> >>> Am 22.05.2013 05:08, schrieb Paul Shannon: >>>> Hi Frank, >>>> >>>> I am most happiliy using the rBiopaxParser package, and your vignette, in order to extract detailed (but topologically simple) interaction data from the latest Reactome "Homosapiens.owl". Your package offers great power and convenience. >>>> >>>> However, I run into difficulty with namespaces. >>>> >>>> For a simple example, consider this one line from the method listIntances, found in the file R/selectBiopax.R: >>>> >>>> sel = sel & (tolower(biopax$df$class) %in% tolower(stripns(class))) >>>> >>>> As parsed from Homosapiens.owl, the class column of biopax$df has values like these, always containing a namespace prefix: >>>> >>>> head(unique(biopax$df$class)) >>>> "bp:BiochemicalReaction" "bp:Protein" >>>> "bp:CellularLocationVocabulary" "bp:UnificationXref" >>>> "bp:ProteinReference" "bp:BioSource" >>>> >>>> By stripping the namespace off of "bp:Protein" (the right hand side of the %in% clause) it cannot match the biopax$df$class value, as parsed from the owl file (which preserves the "bp:"). >>>> >>>> I believe I see similar logic in other places, with these methods specifically encountered so far: >>>> >>>> selectInstances >>>> listPathwayComponents >>>> >>>> Namespaces are used with the "property" column as well: >>>> >>>> head(table(biopax$df$property), n=3) >>>> bp:author bp:cellularLocation bp:comment >>>> 55654 23838 123750 >>>> >>>> Speaking from the nickel seats, and not claiming to understand all of the implications: perhaps these could be neatly avoided if your readBiopax method could optionally eliminate namespaces when reading in an owl file? >>>> >>>> Thanks, >>>> >>>> - Paul >>>> >>>> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 10.9 years ago Paul Shannon ▴ 470

Login before adding your answer.