accessing GenBank features

0

Entering edit mode

Andrew Yee ▴ 350

@andrew-yee-2667

Last seen 11.4 years ago

Thinking about this more broadly, is there a Bioconductor package that lets you parse out the different features listed in a GenBank feature, somewhat akin to this: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation Thanks, Andrew On Wed, Sep 21, 2011 at 5:41 PM, Andrew Yee <yee at="" post.harvard.edu=""> wrote: > Thanks for the reply. ?I guess on a broader level, is there a way to > extract the sig_peptide field from > > http://www.ncbi.nlm.nih.gov/nuccore/NM_000610.3 > > I'm trying to figure out why the document reference in Carey's example > doesn't contain "sig_peptide" yet is visible on that web page. > > Perhaps there is another method of getting the annotation for > sig_peptide within GenBank? > > Thanks, > Andrew > > On Wed, Sep 21, 2011 at 4:07 PM, Vincent Carey > <stvjc at="" channing.harvard.edu=""> wrote: >> I don't see a sig_peptide field.? You should have a look at >> >> http://www.omegahat.org/RSXML/shortIntro.html >> >> and references therein. >> >> It has been a long time since I did anything with XML per se. We did a >> certain amount of exposition in Chapter 8 >> of the 2005 Springer monograph.? Since then more XPath support has come in >> and many new ideas help distance users from >> details of XML processing.? To illustrate a bit with your example, I trapped >> the actual document reference >> >> zz = >> xmlInternalTreeParse("http://www.ncbi.nih.gov/entrez/eutils/efetch. fcgi?tool=bioconductor&rettype=xml&retmode=text&db=Nucleotide&id=NM_00 0610") >> >> and then performed an XPath query >> >>> getNodeSet(zz, "//Seq-interval_from") >> [[1]] >> <seq-interval_from>3244</seq-interval_from> >> >> [[2]] >> <seq-interval_from>3328</seq-interval_from> >> >> [[3]] >> <seq-interval_from>5695</seq-interval_from> >> >> and so on.? I don't recall how to do a relatively simple task like >> "enumerate all tags in use in a document" but it can be done with the XML >> package tools.? I think it will be more effective to isolate the use case >> and see how to use eutils to solve it fairly directly as opposed to wading >> through XML, but perhaps wading is inevitable. >> >> >> On Wed, Sep 21, 2011 at 12:29 PM, Andrew Yee <yee at="" post.harvard.edu=""> wrote: >>> >>> Hi, I'm looking for some guidance in terms of parsing the XML output >>> from a genbank query. >>> >>> result <- genbank('NM_000610', disp='data', type='uid') >>> >>> I'm trying to figure out how to use the XML package in order to parse >>> out the "sig_peptide" field from the XML output from the genbank >>> query. >>> >>> Any pointers or suggestions would be appreciated, as I'm new to XML. >>> >>> Thanks, >>> Andrew >>> >>> >>> >>> > sessionInfo() >>> R version 2.13.0 (2011-04-13) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> ?[1] LC_CTYPE=en_US.UTF-8 ? ? ? LC_NUMERIC=C >>> ?[3] LC_TIME=en_US.UTF-8 ? ? ? ?LC_COLLATE=en_US.UTF-8 >>> ?[5] LC_MONETARY=C ? ? ? ? ? ? ?LC_MESSAGES=en_US.UTF-8 >>> ?[7] LC_PAPER=en_US.UTF-8 ? ? ? LC_NAME=C >>> ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base >>> >>> other attached packages: >>> [1] XML_3.2-0 ? ? ? ? ? ? annotate_1.29.4 ? ? ? AnnotationDbi_1.13.21 >>> [4] Biobase_2.11.10 >>> >>> loaded via a namespace (and not attached): >>> [1] DBI_0.2-5 ? ? RSQLite_0.9-4 tools_2.13.0 ?xtable_1.5-6 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >

Annotation Annotation • 2.8k views

ADD COMMENT • link updated 14.4 years ago by Martin Morgan 25k • written 14.4 years ago by Andrew Yee ▴ 350

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 4 weeks ago

United States

Hi, Andrew. This type of thing is done fairly robustly by bioperl and biopython, so you could consider those languages/projects. If you want to stick with R, consider using something like the genbank2gff3.pl script that ships with bioperl to convert genbank files into gff3 files that may lend themselves to a better fit with R. The restriction for genbank2gff3 is that the feature types are those supported in SOFA (sequence feature ontology). Sean On Wed, Sep 21, 2011 at 5:55 PM, Andrew Yee <yee at="" post.harvard.edu=""> wrote: > Thinking about this more broadly, is there a Bioconductor package that > lets you parse out the different features listed in a GenBank feature, > somewhat akin to this: > > http://www.bioperl.org/wiki/HOWTO:Feature-Annotation > > Thanks, > Andrew > > On Wed, Sep 21, 2011 at 5:41 PM, Andrew Yee <yee at="" post.harvard.edu=""> wrote: >> Thanks for the reply. ?I guess on a broader level, is there a way to >> extract the sig_peptide field from >> >> http://www.ncbi.nlm.nih.gov/nuccore/NM_000610.3 >> >> I'm trying to figure out why the document reference in Carey's example >> doesn't contain "sig_peptide" yet is visible on that web page. >> >> Perhaps there is another method of getting the annotation for >> sig_peptide within GenBank? >> >> Thanks, >> Andrew >> >> On Wed, Sep 21, 2011 at 4:07 PM, Vincent Carey >> <stvjc at="" channing.harvard.edu=""> wrote: >>> I don't see a sig_peptide field.? You should have a look at >>> >>> http://www.omegahat.org/RSXML/shortIntro.html >>> >>> and references therein. >>> >>> It has been a long time since I did anything with XML per se. We did a >>> certain amount of exposition in Chapter 8 >>> of the 2005 Springer monograph.? Since then more XPath support has come in >>> and many new ideas help distance users from >>> details of XML processing.? To illustrate a bit with your example, I trapped >>> the actual document reference >>> >>> zz = >>> xmlInternalTreeParse("http://www.ncbi.nih.gov/entrez/eutils/efetch .fcgi?tool=bioconductor&rettype=xml&retmode=text&db=Nucleotide&id=NM_0 00610") >>> >>> and then performed an XPath query >>> >>>> getNodeSet(zz, "//Seq-interval_from") >>> [[1]] >>> <seq-interval_from>3244</seq-interval_from> >>> >>> [[2]] >>> <seq-interval_from>3328</seq-interval_from> >>> >>> [[3]] >>> <seq-interval_from>5695</seq-interval_from> >>> >>> and so on.? I don't recall how to do a relatively simple task like >>> "enumerate all tags in use in a document" but it can be done with the XML >>> package tools.? I think it will be more effective to isolate the use case >>> and see how to use eutils to solve it fairly directly as opposed to wading >>> through XML, but perhaps wading is inevitable. >>> >>> >>> On Wed, Sep 21, 2011 at 12:29 PM, Andrew Yee <yee at="" post.harvard.edu=""> wrote: >>>> >>>> Hi, I'm looking for some guidance in terms of parsing the XML output >>>> from a genbank query. >>>> >>>> result <- genbank('NM_000610', disp='data', type='uid') >>>> >>>> I'm trying to figure out how to use the XML package in order to parse >>>> out the "sig_peptide" field from the XML output from the genbank >>>> query. >>>> >>>> Any pointers or suggestions would be appreciated, as I'm new to XML. >>>> >>>> Thanks, >>>> Andrew >>>> >>>> >>>> >>>> > sessionInfo() >>>> R version 2.13.0 (2011-04-13) >>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>> >>>> locale: >>>> ?[1] LC_CTYPE=en_US.UTF-8 ? ? ? LC_NUMERIC=C >>>> ?[3] LC_TIME=en_US.UTF-8 ? ? ? ?LC_COLLATE=en_US.UTF-8 >>>> ?[5] LC_MONETARY=C ? ? ? ? ? ? ?LC_MESSAGES=en_US.UTF-8 >>>> ?[7] LC_PAPER=en_US.UTF-8 ? ? ? LC_NAME=C >>>> ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C >>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base >>>> >>>> other attached packages: >>>> [1] XML_3.2-0 ? ? ? ? ? ? annotate_1.29.4 ? ? ? AnnotationDbi_1.13.21 >>>> [4] Biobase_2.11.10 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] DBI_0.2-5 ? ? RSQLite_0.9-4 tools_2.13.0 ?xtable_1.5-6 >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 14.4 years ago Sean Davis 21k

0

Entering edit mode

Hi Andrew, > Thinking about this more broadly, is there a Bioconductor package > that lets you parse out the different features listed in a > GenBank feature, somewhat akin to this: > http://www.bioperl.org/wiki/HOWTO:Feature-Annotation You may want to take a look into the genoPlotR package from CRAN. It allows you to read GenBank files directly and to extract your features of interest from the resulting data.frame. Hth Gerhard ---------------------------------------------------------------------- -- Dr. Gerhard Thallinger E-mail: Gerhard.Thallinger at tugraz.at Institute for Genomics and Bioinformatics Web: http://genome.tugraz.at Graz University of Technology Tel: +43 316 873 5343 Petersgasse 14/V Fax: +43 316 873 105343 8010 Graz, Austria Map: http://genome.tugraz.at/Loc.html

ADD REPLY • link 14.4 years ago Gerhard Thallinger ▴ 180

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 19 days ago

United States

On 09/21/2011 02:55 PM, Andrew Yee wrote: > Thinking about this more broadly, is there a Bioconductor package that > lets you parse out the different features listed in a GenBank feature, > somewhat akin to this: > > http://www.bioperl.org/wiki/HOWTO:Feature-Annotation A helpful answer on Biostar by neilfws http://biostar.stackexchange.com/questions/12405/extracting-xml- output-from-genbank-query suggests using eutils with query library(XML) ## formulate an e-utils query baseurl <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi" url <- sprintf("%s?db=%s&id=%s&rettype=gb&retmode=xml", baseurl, "nucleotide", "NM_000610") I'd process this as ## retrieve the document xml <- xmlInternalTreeParse(url) ## extract GBFeature nodes ('//GBFeature') restricted ('[...]') to ## those with sub-node GBFeature_key with text value ('text()') equal ## to 'sig_peptide' -- there is just one of these query <- "//GBFeature[ GBFeature_key/text()='sig_peptide' ]" node <- getNodeSet(xml, query)[[1]] ## query the extracted node, starting at the current location ('./') query <- "./GBFeature_location/text()" xpathApply(node, query, xmlValue) ## do string coercion in XML query <- "string(.//GBInterval_accession/text())" getNodeSet(node, query) ## one-liner -- associated gene name query <- "//GBFeature[ GBFeature_key/text()='sig_peptide' ] //GBQualifier[ GBQualifier_name/text()='gene' ] /GBQualifier_value/text()" xpathApply(xml, query, xmlValue) The key resources for this are the XPath documentation http://www.w3.org/TR/xpath/ especially sections 2.5 (Abbreviated Syntax; the place to start) and 4 (Core Function Library). Also while exploring the document structure visually is probably a good way to start, for the hard-core the DTD is where the structure is defined, http://www.ncbi.nlm.nih.gov/data_specs/dtd/ e.g., http://www.ncbi.nlm.nih.gov/data_specs/dtd/NCBI_GBSeq.mod.dtd Martin > > Thanks, > Andrew > > On Wed, Sep 21, 2011 at 5:41 PM, Andrew Yee<yee at="" post.harvard.edu=""> wrote: >> Thanks for the reply. I guess on a broader level, is there a way to >> extract the sig_peptide field from >> >> http://www.ncbi.nlm.nih.gov/nuccore/NM_000610.3 >> >> I'm trying to figure out why the document reference in Carey's example >> doesn't contain "sig_peptide" yet is visible on that web page. >> >> Perhaps there is another method of getting the annotation for >> sig_peptide within GenBank? >> >> Thanks, >> Andrew >> >> On Wed, Sep 21, 2011 at 4:07 PM, Vincent Carey >> <stvjc at="" channing.harvard.edu=""> wrote: >>> I don't see a sig_peptide field. You should have a look at >>> >>> http://www.omegahat.org/RSXML/shortIntro.html >>> >>> and references therein. >>> >>> It has been a long time since I did anything with XML per se. We did a >>> certain amount of exposition in Chapter 8 >>> of the 2005 Springer monograph. Since then more XPath support has come in >>> and many new ideas help distance users from >>> details of XML processing. To illustrate a bit with your example, I trapped >>> the actual document reference >>> >>> zz = >>> xmlInternalTreeParse("http://www.ncbi.nih.gov/entrez/eutils/efetch .fcgi?tool=bioconductor&rettype=xml&retmode=text&db=Nucleotide&id=NM_0 00610") >>> >>> and then performed an XPath query >>> >>>> getNodeSet(zz, "//Seq-interval_from") >>> [[1]] >>> <seq-interval_from>3244</seq-interval_from> >>> >>> [[2]] >>> <seq-interval_from>3328</seq-interval_from> >>> >>> [[3]] >>> <seq-interval_from>5695</seq-interval_from> >>> >>> and so on. I don't recall how to do a relatively simple task like >>> "enumerate all tags in use in a document" but it can be done with the XML >>> package tools. I think it will be more effective to isolate the use case >>> and see how to use eutils to solve it fairly directly as opposed to wading >>> through XML, but perhaps wading is inevitable. >>> >>> >>> On Wed, Sep 21, 2011 at 12:29 PM, Andrew Yee<yee at="" post.harvard.edu=""> wrote: >>>> >>>> Hi, I'm looking for some guidance in terms of parsing the XML output >>>> from a genbank query. >>>> >>>> result<- genbank('NM_000610', disp='data', type='uid') >>>> >>>> I'm trying to figure out how to use the XML package in order to parse >>>> out the "sig_peptide" field from the XML output from the genbank >>>> query. >>>> >>>> Any pointers or suggestions would be appreciated, as I'm new to XML. >>>> >>>> Thanks, >>>> Andrew >>>> >>>> >>>> >>>>> sessionInfo() >>>> R version 2.13.0 (2011-04-13) >>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>> >>>> locale: >>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 >>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods base >>>> >>>> other attached packages: >>>> [1] XML_3.2-0 annotate_1.29.4 AnnotationDbi_1.13.21 >>>> [4] Biobase_2.11.10 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] DBI_0.2-5 RSQLite_0.9-4 tools_2.13.0 xtable_1.5-6 >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793

ADD COMMENT • link 14.4 years ago Martin Morgan 25k

0

Entering edit mode

Thanks for posting this...I was going to post it as well. Question: is there a way for the genbank() function to be updated to pull this additional information down by default? In the Biostar post by neilfws, he points out that the sig_peptide information isn't in the XML data retrieved by genbank() . Thanks, Andrew On Thu, Sep 22, 2011 at 12:18 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: > On 09/21/2011 02:55 PM, Andrew Yee wrote: >> >> Thinking about this more broadly, is there a Bioconductor package that >> lets you parse out the different features listed in a GenBank feature, >> somewhat akin to this: >> >> http://www.bioperl.org/wiki/HOWTO:Feature-Annotation > > A helpful answer on Biostar by neilfws > > http://biostar.stackexchange.com/questions/12405/extracting-xml- output-from-genbank-query > > suggests using eutils with query > > library(XML) > > ## formulate an e-utils query > baseurl <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi" > url <- sprintf("%s?db=%s&id=%s&rettype=gb&retmode=xml", > ? ? ? ? ? ? ? baseurl, "nucleotide", "NM_000610") > > I'd process this as > > ## retrieve the document > xml <- xmlInternalTreeParse(url) > > ## extract GBFeature nodes ('//GBFeature') restricted ('[...]') to > ## those with sub-node GBFeature_key with text value ('text()') equal > ## to 'sig_peptide' -- there is just one of these > query <- "//GBFeature[ GBFeature_key/text()='sig_peptide' ]" > node <- getNodeSet(xml, query)[[1]] > > ## query the extracted node, starting at the current location ('./') > query <- "./GBFeature_location/text()" > xpathApply(node, query, xmlValue) > > ## do string coercion in XML > query <- "string(.//GBInterval_accession/text())" > getNodeSet(node, query) > > ## one-liner -- associated gene name > query <- "//GBFeature[ GBFeature_key/text()='sig_peptide' ] > ? ? ? ? ?//GBQualifier[ GBQualifier_name/text()='gene' ] > ? ? ? ? ?/GBQualifier_value/text()" > xpathApply(xml, query, xmlValue) > > The key resources for this are the XPath documentation > > http://www.w3.org/TR/xpath/ > > especially sections 2.5 (Abbreviated Syntax; the place to start) and 4 (Core > Function Library). Also while exploring the document structure visually is > probably a good way to start, for the hard-core the ?DTD is where the > structure is defined, > > http://www.ncbi.nlm.nih.gov/data_specs/dtd/ > > e.g., > > http://www.ncbi.nlm.nih.gov/data_specs/dtd/NCBI_GBSeq.mod.dtd > > Martin >> >> Thanks, >> Andrew >> >> On Wed, Sep 21, 2011 at 5:41 PM, Andrew Yee<yee at="" post.harvard.edu=""> ?wrote: >>> >>> Thanks for the reply. ?I guess on a broader level, is there a way to >>> extract the sig_peptide field from >>> >>> http://www.ncbi.nlm.nih.gov/nuccore/NM_000610.3 >>> >>> I'm trying to figure out why the document reference in Carey's example >>> doesn't contain "sig_peptide" yet is visible on that web page. >>> >>> Perhaps there is another method of getting the annotation for >>> sig_peptide within GenBank? >>> >>> Thanks, >>> Andrew >>> >>> On Wed, Sep 21, 2011 at 4:07 PM, Vincent Carey >>> <stvjc at="" channing.harvard.edu=""> ?wrote: >>>> >>>> I don't see a sig_peptide field. ?You should have a look at >>>> >>>> http://www.omegahat.org/RSXML/shortIntro.html >>>> >>>> and references therein. >>>> >>>> It has been a long time since I did anything with XML per se. We did a >>>> certain amount of exposition in Chapter 8 >>>> of the 2005 Springer monograph. ?Since then more XPath support has come >>>> in >>>> and many new ideas help distance users from >>>> details of XML processing. ?To illustrate a bit with your example, I >>>> trapped >>>> the actual document reference >>>> >>>> zz = >>>> >>>> xmlInternalTreeParse("http://www.ncbi.nih.gov/entrez/eutils/efetc h.fcgi?tool=bioconductor&rettype=xml&retmode=text&db=Nucleotide&id=NM_ 000610") >>>> >>>> and then performed an XPath query >>>> >>>>> getNodeSet(zz, "//Seq-interval_from") >>>> >>>> [[1]] >>>> <seq-interval_from>3244</seq-interval_from> >>>> >>>> [[2]] >>>> <seq-interval_from>3328</seq-interval_from> >>>> >>>> [[3]] >>>> <seq-interval_from>5695</seq-interval_from> >>>> >>>> and so on. ?I don't recall how to do a relatively simple task like >>>> "enumerate all tags in use in a document" but it can be done with the >>>> XML >>>> package tools. ?I think it will be more effective to isolate the use >>>> case >>>> and see how to use eutils to solve it fairly directly as opposed to >>>> wading >>>> through XML, but perhaps wading is inevitable. >>>> >>>> >>>> On Wed, Sep 21, 2011 at 12:29 PM, Andrew Yee<yee at="" post.harvard.edu=""> >>>> ?wrote: >>>>> >>>>> Hi, I'm looking for some guidance in terms of parsing the XML output >>>>> from a genbank query. >>>>> >>>>> result<- genbank('NM_000610', disp='data', type='uid') >>>>> >>>>> I'm trying to figure out how to use the XML package in order to parse >>>>> out the "sig_peptide" field from the XML output from the genbank >>>>> query. >>>>> >>>>> Any pointers or suggestions would be appreciated, as I'm new to XML. >>>>> >>>>> Thanks, >>>>> Andrew >>>>> >>>>> >>>>> >>>>>> sessionInfo() >>>>> >>>>> R version 2.13.0 (2011-04-13) >>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>> >>>>> locale: >>>>> ?[1] LC_CTYPE=en_US.UTF-8 ? ? ? LC_NUMERIC=C >>>>> ?[3] LC_TIME=en_US.UTF-8 ? ? ? ?LC_COLLATE=en_US.UTF-8 >>>>> ?[5] LC_MONETARY=C ? ? ? ? ? ? ?LC_MESSAGES=en_US.UTF-8 >>>>> ?[7] LC_PAPER=en_US.UTF-8 ? ? ? LC_NAME=C >>>>> ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C >>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>> >>>>> attached base packages: >>>>> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base >>>>> >>>>> other attached packages: >>>>> [1] XML_3.2-0 ? ? ? ? ? ? annotate_1.29.4 ? ? ? AnnotationDbi_1.13.21 >>>>> [4] Biobase_2.11.10 >>>>> >>>>> loaded via a namespace (and not attached): >>>>> [1] DBI_0.2-5 ? ? RSQLite_0.9-4 tools_2.13.0 ?xtable_1.5-6 >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793 >

ADD REPLY • link 14.4 years ago Andrew Yee ▴ 350

0

Entering edit mode

On Thu, Sep 22, 2011 at 12:34 PM, Andrew Yee <yee@post.harvard.edu> wrote: > Thanks for posting this...I was going to post it as well. > > Question: is there a way for the genbank() function to be updated to > pull this additional information down by default? In the Biostar post > by neilfws, he points out that the sig_peptide information isn't in > the XML data retrieved by genbank() . > > The objective needs to be clarified. genbank() does a reasonably well-documented task, and it returns _all_ the information (when disp="data"), it just doesn't model the response except as an XMLDocument instance. The retmode setting affects what comes back -- i used "text" and didn't see any instance of sig_peptide, but if you use "xml" you see it. There are a few things to consider for extending the annotation software that interacts with genbank. What aspects of Martin's XPath programming should be abstracted into reusable modules? This depends on the use cases of interest, and I haven't seen anything specific. Related to this: How do we want to structure the objects that are returned by genbank-related queries? We can get a terminology and data structure framework from the DTDs, perhaps, and/or we can try to use SOFA as mentioned by Sean, to get vocabulary and, perhaps, a data model sketch. We would probably want to use Biostrings and GenomicRanges resources to model aspects of the results. The infrastructure is all there. > Thanks, > Andrew > > On Thu, Sep 22, 2011 at 12:18 PM, Martin Morgan <mtmorgan@fhcrc.org> > wrote: > > On 09/21/2011 02:55 PM, Andrew Yee wrote: > >> > >> Thinking about this more broadly, is there a Bioconductor package that > >> lets you parse out the different features listed in a GenBank feature, > >> somewhat akin to this: > >> > >> http://www.bioperl.org/wiki/HOWTO:Feature-Annotation > > > > A helpful answer on Biostar by neilfws > > > > > http://biostar.stackexchange.com/questions/12405/extracting-xml- output-from-genbank-query > > > > suggests using eutils with query > > > > library(XML) > > > > ## formulate an e-utils query > > baseurl <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi" > > url <- sprintf("%s?db=%s&id=%s&rettype=gb&retmode=xml", > > baseurl, "nucleotide", "NM_000610") > > > > I'd process this as > > > > ## retrieve the document > > xml <- xmlInternalTreeParse(url) > > > > ## extract GBFeature nodes ('//GBFeature') restricted ('[...]') to > > ## those with sub-node GBFeature_key with text value ('text()') equal > > ## to 'sig_peptide' -- there is just one of these > > query <- "//GBFeature[ GBFeature_key/text()='sig_peptide' ]" > > node <- getNodeSet(xml, query)[[1]] > > > > ## query the extracted node, starting at the current location ('./') > > query <- "./GBFeature_location/text()" > > xpathApply(node, query, xmlValue) > > > > ## do string coercion in XML > > query <- "string(.//GBInterval_accession/text())" > > getNodeSet(node, query) > > > > ## one-liner -- associated gene name > > query <- "//GBFeature[ GBFeature_key/text()='sig_peptide' ] > > //GBQualifier[ GBQualifier_name/text()='gene' ] > > /GBQualifier_value/text()" > > xpathApply(xml, query, xmlValue) > > > > The key resources for this are the XPath documentation > > > > http://www.w3.org/TR/xpath/ > > > > especially sections 2.5 (Abbreviated Syntax; the place to start) and 4 > (Core > > Function Library). Also while exploring the document structure visually > is > > probably a good way to start, for the hard-core the DTD is where the > > structure is defined, > > > > http://www.ncbi.nlm.nih.gov/data_specs/dtd/ > > > > e.g., > > > > http://www.ncbi.nlm.nih.gov/data_specs/dtd/NCBI_GBSeq.mod.dtd > > > > Martin > >> > >> Thanks, > >> Andrew > >> > >> On Wed, Sep 21, 2011 at 5:41 PM, Andrew Yee<yee@post.harvard.edu> > wrote: > >>> > >>> Thanks for the reply. I guess on a broader level, is there a way to > >>> extract the sig_peptide field from > >>> > >>> http://www.ncbi.nlm.nih.gov/nuccore/NM_000610.3 > >>> > >>> I'm trying to figure out why the document reference in Carey's example > >>> doesn't contain "sig_peptide" yet is visible on that web page. > >>> > >>> Perhaps there is another method of getting the annotation for > >>> sig_peptide within GenBank? > >>> > >>> Thanks, > >>> Andrew > >>> > >>> On Wed, Sep 21, 2011 at 4:07 PM, Vincent Carey > >>> <stvjc@channing.harvard.edu> wrote: > >>>> > >>>> I don't see a sig_peptide field. You should have a look at > >>>> > >>>> http://www.omegahat.org/RSXML/shortIntro.html > >>>> > >>>> and references therein. > >>>> > >>>> It has been a long time since I did anything with XML per se. We did a > >>>> certain amount of exposition in Chapter 8 > >>>> of the 2005 Springer monograph. Since then more XPath support has > come > >>>> in > >>>> and many new ideas help distance users from > >>>> details of XML processing. To illustrate a bit with your example, I > >>>> trapped > >>>> the actual document reference > >>>> > >>>> zz = > >>>> > >>>> xmlInternalTreeParse(" > http://www.ncbi.nih.gov/entrez/eutils/efetch.fcgi?tool=bioconductor& rettype=xml&retmode=text&db=Nucleotide&id=NM_000610 > ") > >>>> > >>>> and then performed an XPath query > >>>> > >>>>> getNodeSet(zz, "//Seq-interval_from") > >>>> > >>>> [[1]] > >>>> <seq-interval_from>3244</seq-interval_from> > >>>> > >>>> [[2]] > >>>> <seq-interval_from>3328</seq-interval_from> > >>>> > >>>> [[3]] > >>>> <seq-interval_from>5695</seq-interval_from> > >>>> > >>>> and so on. I don't recall how to do a relatively simple task like > >>>> "enumerate all tags in use in a document" but it can be done with the > >>>> XML > >>>> package tools. I think it will be more effective to isolate the use > >>>> case > >>>> and see how to use eutils to solve it fairly directly as opposed to > >>>> wading > >>>> through XML, but perhaps wading is inevitable. > >>>> > >>>> > >>>> On Wed, Sep 21, 2011 at 12:29 PM, Andrew Yee<yee@post.harvard.edu> > >>>> wrote: > >>>>> > >>>>> Hi, I'm looking for some guidance in terms of parsing the XML output > >>>>> from a genbank query. > >>>>> > >>>>> result<- genbank('NM_000610', disp='data', type='uid') > >>>>> > >>>>> I'm trying to figure out how to use the XML package in order to parse > >>>>> out the "sig_peptide" field from the XML output from the genbank > >>>>> query. > >>>>> > >>>>> Any pointers or suggestions would be appreciated, as I'm new to XML. > >>>>> > >>>>> Thanks, > >>>>> Andrew > >>>>> > >>>>> > >>>>> > >>>>>> sessionInfo() > >>>>> > >>>>> R version 2.13.0 (2011-04-13) > >>>>> Platform: x86_64-unknown-linux-gnu (64-bit) > >>>>> > >>>>> locale: > >>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > >>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > >>>>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > >>>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > >>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C > >>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > >>>>> > >>>>> attached base packages: > >>>>> [1] stats graphics grDevices utils datasets methods base > >>>>> > >>>>> other attached packages: > >>>>> [1] XML_3.2-0 annotate_1.29.4 AnnotationDbi_1.13.21 > >>>>> [4] Biobase_2.11.10 > >>>>> > >>>>> loaded via a namespace (and not attached): > >>>>> [1] DBI_0.2-5 RSQLite_0.9-4 tools_2.13.0 xtable_1.5-6 > >>>>> > >>>>> _______________________________________________ > >>>>> Bioconductor mailing list > >>>>> Bioconductor@r-project.org > >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>> Search the archives: > >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>> > >>>> > >>> > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > -- > > Computational Biology > > Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > > > Location: M1-B861 > > Telephone: 206 667-2793 > > > [[alternative HTML version deleted]]

ADD REPLY • link 14.4 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

> The objective needs to be clarified. ?genbank() does a reasonably> well-documented task, and it returns _all_> the information (when disp="data"), it just doesn't model the response> except as an XMLDocument instance. ?The> retmode setting affects what comes back -- i used "text" and didn't see any> instance of sig_peptide, but if you use> "xml" you see it. If I could push a little bit further, I was trying to figure out in the genbank() function where you can specify "text" v. "xml" and the retmode settings. Or is this parameter set elsewhere? Thanks, Andrew On Thu, Sep 22, 2011 at 2:46 PM, Vincent Carey <stvjc at="" channing.harvard.edu=""> wrote: > > > On Thu, Sep 22, 2011 at 12:34 PM, Andrew Yee <yee at="" post.harvard.edu=""> wrote: >> >> Thanks for posting this...I was going to post it as well. >> >> Question: ?is there a way for the genbank() function to be updated to >> pull this additional information down by default? ?In the Biostar post >> by neilfws, he points out that the sig_peptide information isn't in >> the XML data retrieved by genbank() . >> > > The objective needs to be clarified.? genbank() does a reasonably > well-documented task, and it returns _all_ > the information (when disp="data"), it just doesn't model the response > except as an XMLDocument instance.? The > retmode setting affects what comes back -- i used "text" and didn't see any > instance of sig_peptide, but if you use > "xml" you see it. > > There are a few things to consider for extending the annotation software > that interacts with genbank.? What aspects of Martin's XPath programming > should be abstracted into reusable? modules?? This depends on the use cases > of interest, and I haven't seen anything specific.? Related to this: How do > we want to structure the objects that are returned by genbank- related > queries?? We can get a terminology and data structure framework from the > DTDs, perhaps, and/or we can try to use SOFA as mentioned by Sean, to get > vocabulary and, perhaps, a data model sketch.? We would probably want to use > Biostrings and GenomicRanges resources to model aspects of the results.? The > infrastructure is all there. > > > >> >> Thanks, >> Andrew >> >> On Thu, Sep 22, 2011 at 12:18 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> >> wrote: >> > On 09/21/2011 02:55 PM, Andrew Yee wrote: >> >> >> >> Thinking about this more broadly, is there a Bioconductor package that >> >> lets you parse out the different features listed in a GenBank feature, >> >> somewhat akin to this: >> >> >> >> http://www.bioperl.org/wiki/HOWTO:Feature-Annotation >> > >> > A helpful answer on Biostar by neilfws >> > >> > >> > http://biostar.stackexchange.com/questions/12405/extracting-xml- output-from-genbank-query >> > >> > suggests using eutils with query >> > >> > library(XML) >> > >> > ## formulate an e-utils query >> > baseurl <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi" >> > url <- sprintf("%s?db=%s&id=%s&rettype=gb&retmode=xml", >> > ? ? ? ? ? ? ? baseurl, "nucleotide", "NM_000610") >> > >> > I'd process this as >> > >> > ## retrieve the document >> > xml <- xmlInternalTreeParse(url) >> > >> > ## extract GBFeature nodes ('//GBFeature') restricted ('[...]') to >> > ## those with sub-node GBFeature_key with text value ('text()') equal >> > ## to 'sig_peptide' -- there is just one of these >> > query <- "//GBFeature[ GBFeature_key/text()='sig_peptide' ]" >> > node <- getNodeSet(xml, query)[[1]] >> > >> > ## query the extracted node, starting at the current location ('./') >> > query <- "./GBFeature_location/text()" >> > xpathApply(node, query, xmlValue) >> > >> > ## do string coercion in XML >> > query <- "string(.//GBInterval_accession/text())" >> > getNodeSet(node, query) >> > >> > ## one-liner -- associated gene name >> > query <- "//GBFeature[ GBFeature_key/text()='sig_peptide' ] >> > ? ? ? ? ?//GBQualifier[ GBQualifier_name/text()='gene' ] >> > ? ? ? ? ?/GBQualifier_value/text()" >> > xpathApply(xml, query, xmlValue) >> > >> > The key resources for this are the XPath documentation >> > >> > http://www.w3.org/TR/xpath/ >> > >> > especially sections 2.5 (Abbreviated Syntax; the place to start) and 4 >> > (Core >> > Function Library). Also while exploring the document structure visually >> > is >> > probably a good way to start, for the hard-core the ?DTD is where the >> > structure is defined, >> > >> > http://www.ncbi.nlm.nih.gov/data_specs/dtd/ >> > >> > e.g., >> > >> > http://www.ncbi.nlm.nih.gov/data_specs/dtd/NCBI_GBSeq.mod.dtd >> > >> > Martin >> >> >> >> Thanks, >> >> Andrew >> >> >> >> On Wed, Sep 21, 2011 at 5:41 PM, Andrew Yee<yee at="" post.harvard.edu=""> >> >> ?wrote: >> >>> >> >>> Thanks for the reply. ?I guess on a broader level, is there a way to >> >>> extract the sig_peptide field from >> >>> >> >>> http://www.ncbi.nlm.nih.gov/nuccore/NM_000610.3 >> >>> >> >>> I'm trying to figure out why the document reference in Carey's example >> >>> doesn't contain "sig_peptide" yet is visible on that web page. >> >>> >> >>> Perhaps there is another method of getting the annotation for >> >>> sig_peptide within GenBank? >> >>> >> >>> Thanks, >> >>> Andrew >> >>> >> >>> On Wed, Sep 21, 2011 at 4:07 PM, Vincent Carey >> >>> <stvjc at="" channing.harvard.edu=""> ?wrote: >> >>>> >> >>>> I don't see a sig_peptide field. ?You should have a look at >> >>>> >> >>>> http://www.omegahat.org/RSXML/shortIntro.html >> >>>> >> >>>> and references therein. >> >>>> >> >>>> It has been a long time since I did anything with XML per se. We did >> >>>> a >> >>>> certain amount of exposition in Chapter 8 >> >>>> of the 2005 Springer monograph. ?Since then more XPath support has >> >>>> come >> >>>> in >> >>>> and many new ideas help distance users from >> >>>> details of XML processing. ?To illustrate a bit with your example, I >> >>>> trapped >> >>>> the actual document reference >> >>>> >> >>>> zz = >> >>>> >> >>>> >> >>>> xmlInternalTreeParse("http://www.ncbi.nih.gov/entrez/eutils/ef etch.fcgi?tool=bioconductor&rettype=xml&retmode=text&db=Nucleotide&id= NM_000610") >> >>>> >> >>>> and then performed an XPath query >> >>>> >> >>>>> getNodeSet(zz, "//Seq-interval_from") >> >>>> >> >>>> [[1]] >> >>>> <seq-interval_from>3244</seq-interval_from> >> >>>> >> >>>> [[2]] >> >>>> <seq-interval_from>3328</seq-interval_from> >> >>>> >> >>>> [[3]] >> >>>> <seq-interval_from>5695</seq-interval_from> >> >>>> >> >>>> and so on. ?I don't recall how to do a relatively simple task like >> >>>> "enumerate all tags in use in a document" but it can be done with the >> >>>> XML >> >>>> package tools. ?I think it will be more effective to isolate the use >> >>>> case >> >>>> and see how to use eutils to solve it fairly directly as opposed to >> >>>> wading >> >>>> through XML, but perhaps wading is inevitable. >> >>>> >> >>>> >> >>>> On Wed, Sep 21, 2011 at 12:29 PM, Andrew Yee<yee at="" post.harvard.edu=""> >> >>>> ?wrote: >> >>>>> >> >>>>> Hi, I'm looking for some guidance in terms of parsing the XML output >> >>>>> from a genbank query. >> >>>>> >> >>>>> result<- genbank('NM_000610', disp='data', type='uid') >> >>>>> >> >>>>> I'm trying to figure out how to use the XML package in order to >> >>>>> parse >> >>>>> out the "sig_peptide" field from the XML output from the genbank >> >>>>> query. >> >>>>> >> >>>>> Any pointers or suggestions would be appreciated, as I'm new to XML. >> >>>>> >> >>>>> Thanks, >> >>>>> Andrew >> >>>>> >> >>>>> >> >>>>> >> >>>>>> sessionInfo() >> >>>>> >> >>>>> R version 2.13.0 (2011-04-13) >> >>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >> >>>>> >> >>>>> locale: >> >>>>> ?[1] LC_CTYPE=en_US.UTF-8 ? ? ? LC_NUMERIC=C >> >>>>> ?[3] LC_TIME=en_US.UTF-8 ? ? ? ?LC_COLLATE=en_US.UTF-8 >> >>>>> ?[5] LC_MONETARY=C ? ? ? ? ? ? ?LC_MESSAGES=en_US.UTF-8 >> >>>>> ?[7] LC_PAPER=en_US.UTF-8 ? ? ? LC_NAME=C >> >>>>> ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C >> >>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >>>>> >> >>>>> attached base packages: >> >>>>> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base >> >>>>> >> >>>>> other attached packages: >> >>>>> [1] XML_3.2-0 ? ? ? ? ? ? annotate_1.29.4 >> >>>>> AnnotationDbi_1.13.21 >> >>>>> [4] Biobase_2.11.10 >> >>>>> >> >>>>> loaded via a namespace (and not attached): >> >>>>> [1] DBI_0.2-5 ? ? RSQLite_0.9-4 tools_2.13.0 ?xtable_1.5-6 >> >>>>> >> >>>>> _______________________________________________ >> >>>>> Bioconductor mailing list >> >>>>> Bioconductor at r-project.org >> >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >>>>> Search the archives: >> >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >>>> >> >>>> >> >>> >> >> >> >> _______________________________________________ >> >> Bioconductor mailing list >> >> Bioconductor at r-project.org >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> Search the archives: >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> > >> > -- >> > Computational Biology >> > Fred Hutchinson Cancer Research Center >> > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >> > >> > Location: M1-B861 >> > Telephone: 206 667-2793 >> > > >

ADD REPLY • link 14.4 years ago Andrew Yee ▴ 350

0

Entering edit mode

note the genbank function source is available, and its pmaddress parameter uses by default the .efetch function -- you can write your own version of an address generator that does less or a different form of address coding. > .efetch function (db = "PubMed", disp = c("data", "browser"), type = c("uid", "accession")) { disp <- match.arg(disp) type <- match.arg(type) if (disp == "data") { base <- "entrez/eutils/efetch.fcgi?tool=bioconductor&rettype=xml&retmode=text& db=" } else { base1 <- "entrez/query.fcgi?tool=bioconductor&cmd=" if (type == "uid") { base2 <- "Retrieve&db=" } else { base2 <- "Search&db=" } base <- paste(base1, base2, sep = "") } return(paste(base, db, sep = "")) } <environment: namespace:annotate=""> On Mon, Sep 26, 2011 at 11:01 AM, Andrew Yee <yee@post.harvard.edu> wrote: > > The objective needs to be clarified. genbank() does a reasonably> > well-documented task, and it returns _all_> the information (when > disp="data"), it just doesn't model the response> except as an XMLDocument > instance. The> retmode setting affects what comes back -- i used "text" and > didn't see any> instance of sig_peptide, but if you use> "xml" you see it. > > If I could push a little bit further, I was trying to figure out in > the genbank() function where you can specify "text" v. "xml" and the > retmode settings. Or is this parameter set elsewhere? > > Thanks, > Andrew > > On Thu, Sep 22, 2011 at 2:46 PM, Vincent Carey > <stvjc@channing.harvard.edu> wrote: > > > > > > On Thu, Sep 22, 2011 at 12:34 PM, Andrew Yee <yee@post.harvard.edu> > wrote: > >> > >> Thanks for posting this...I was going to post it as well. > >> > >> Question: is there a way for the genbank() function to be updated to > >> pull this additional information down by default? In the Biostar post > >> by neilfws, he points out that the sig_peptide information isn't in > >> the XML data retrieved by genbank() . > >> > > > > The objective needs to be clarified. genbank() does a reasonably > > well-documented task, and it returns _all_ > > the information (when disp="data"), it just doesn't model the response > > except as an XMLDocument instance. The > > retmode setting affects what comes back -- i used "text" and didn't see > any > > instance of sig_peptide, but if you use > > "xml" you see it. > > > > There are a few things to consider for extending the annotation software > > that interacts with genbank. What aspects of Martin's XPath programming > > should be abstracted into reusable modules? This depends on the use > cases > > of interest, and I haven't seen anything specific. Related to this: How > do > > we want to structure the objects that are returned by genbank- related > > queries? We can get a terminology and data structure framework from the > > DTDs, perhaps, and/or we can try to use SOFA as mentioned by Sean, to get > > vocabulary and, perhaps, a data model sketch. We would probably want to > use > > Biostrings and GenomicRanges resources to model aspects of the results. > The > > infrastructure is all there. > > > > > > > >> > >> Thanks, > >> Andrew > >> > >> On Thu, Sep 22, 2011 at 12:18 PM, Martin Morgan <mtmorgan@fhcrc.org> > >> wrote: > >> > On 09/21/2011 02:55 PM, Andrew Yee wrote: > >> >> > >> >> Thinking about this more broadly, is there a Bioconductor package > that > >> >> lets you parse out the different features listed in a GenBank > feature, > >> >> somewhat akin to this: > >> >> > >> >> http://www.bioperl.org/wiki/HOWTO:Feature-Annotation > >> > > >> > A helpful answer on Biostar by neilfws > >> > > >> > > >> > > http://biostar.stackexchange.com/questions/12405/extracting-xml- output-from-genbank-query > >> > > >> > suggests using eutils with query > >> > > >> > library(XML) > >> > > >> > ## formulate an e-utils query > >> > baseurl <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi" > >> > url <- sprintf("%s?db=%s&id=%s&rettype=gb&retmode=xml", > >> > baseurl, "nucleotide", "NM_000610") > >> > > >> > I'd process this as > >> > > >> > ## retrieve the document > >> > xml <- xmlInternalTreeParse(url) > >> > > >> > ## extract GBFeature nodes ('//GBFeature') restricted ('[...]') to > >> > ## those with sub-node GBFeature_key with text value ('text()') equal > >> > ## to 'sig_peptide' -- there is just one of these > >> > query <- "//GBFeature[ GBFeature_key/text()='sig_peptide' ]" > >> > node <- getNodeSet(xml, query)[[1]] > >> > > >> > ## query the extracted node, starting at the current location ('./') > >> > query <- "./GBFeature_location/text()" > >> > xpathApply(node, query, xmlValue) > >> > > >> > ## do string coercion in XML > >> > query <- "string(.//GBInterval_accession/text())" > >> > getNodeSet(node, query) > >> > > >> > ## one-liner -- associated gene name > >> > query <- "//GBFeature[ GBFeature_key/text()='sig_peptide' ] > >> > //GBQualifier[ GBQualifier_name/text()='gene' ] > >> > /GBQualifier_value/text()" > >> > xpathApply(xml, query, xmlValue) > >> > > >> > The key resources for this are the XPath documentation > >> > > >> > http://www.w3.org/TR/xpath/ > >> > > >> > especially sections 2.5 (Abbreviated Syntax; the place to start) and 4 > >> > (Core > >> > Function Library). Also while exploring the document structure > visually > >> > is > >> > probably a good way to start, for the hard-core the DTD is where the > >> > structure is defined, > >> > > >> > http://www.ncbi.nlm.nih.gov/data_specs/dtd/ > >> > > >> > e.g., > >> > > >> > http://www.ncbi.nlm.nih.gov/data_specs/dtd/NCBI_GBSeq.mod.dtd > >> > > >> > Martin > >> >> > >> >> Thanks, > >> >> Andrew > >> >> > >> >> On Wed, Sep 21, 2011 at 5:41 PM, Andrew Yee<yee@post.harvard.edu> > >> >> wrote: > >> >>> > >> >>> Thanks for the reply. I guess on a broader level, is there a way to > >> >>> extract the sig_peptide field from > >> >>> > >> >>> http://www.ncbi.nlm.nih.gov/nuccore/NM_000610.3 > >> >>> > >> >>> I'm trying to figure out why the document reference in Carey's > example > >> >>> doesn't contain "sig_peptide" yet is visible on that web page. > >> >>> > >> >>> Perhaps there is another method of getting the annotation for > >> >>> sig_peptide within GenBank? > >> >>> > >> >>> Thanks, > >> >>> Andrew > >> >>> > >> >>> On Wed, Sep 21, 2011 at 4:07 PM, Vincent Carey > >> >>> <stvjc@channing.harvard.edu> wrote: > >> >>>> > >> >>>> I don't see a sig_peptide field. You should have a look at > >> >>>> > >> >>>> http://www.omegahat.org/RSXML/shortIntro.html > >> >>>> > >> >>>> and references therein. > >> >>>> > >> >>>> It has been a long time since I did anything with XML per se. We > did > >> >>>> a > >> >>>> certain amount of exposition in Chapter 8 > >> >>>> of the 2005 Springer monograph. Since then more XPath support has > >> >>>> come > >> >>>> in > >> >>>> and many new ideas help distance users from > >> >>>> details of XML processing. To illustrate a bit with your example, > I > >> >>>> trapped > >> >>>> the actual document reference > >> >>>> > >> >>>> zz = > >> >>>> > >> >>>> > >> >>>> xmlInternalTreeParse(" > http://www.ncbi.nih.gov/entrez/eutils/efetch.fcgi?tool=bioconductor& rettype=xml&retmode=text&db=Nucleotide&id=NM_000610 > ") > >> >>>> > >> >>>> and then performed an XPath query > >> >>>> > >> >>>>> getNodeSet(zz, "//Seq-interval_from") > >> >>>> > >> >>>> [[1]] > >> >>>> <seq-interval_from>3244</seq-interval_from> > >> >>>> > >> >>>> [[2]] > >> >>>> <seq-interval_from>3328</seq-interval_from> > >> >>>> > >> >>>> [[3]] > >> >>>> <seq-interval_from>5695</seq-interval_from> > >> >>>> > >> >>>> and so on. I don't recall how to do a relatively simple task like > >> >>>> "enumerate all tags in use in a document" but it can be done with > the > >> >>>> XML > >> >>>> package tools. I think it will be more effective to isolate the > use > >> >>>> case > >> >>>> and see how to use eutils to solve it fairly directly as opposed to > >> >>>> wading > >> >>>> through XML, but perhaps wading is inevitable. > >> >>>> > >> >>>> > >> >>>> On Wed, Sep 21, 2011 at 12:29 PM, Andrew Yee<yee@post.harvard.edu> > >> >>>> wrote: > >> >>>>> > >> >>>>> Hi, I'm looking for some guidance in terms of parsing the XML > output > >> >>>>> from a genbank query. > >> >>>>> > >> >>>>> result<- genbank('NM_000610', disp='data', type='uid') > >> >>>>> > >> >>>>> I'm trying to figure out how to use the XML package in order to > >> >>>>> parse > >> >>>>> out the "sig_peptide" field from the XML output from the genbank > >> >>>>> query. > >> >>>>> > >> >>>>> Any pointers or suggestions would be appreciated, as I'm new to > XML. > >> >>>>> > >> >>>>> Thanks, > >> >>>>> Andrew > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>>> sessionInfo() > >> >>>>> > >> >>>>> R version 2.13.0 (2011-04-13) > >> >>>>> Platform: x86_64-unknown-linux-gnu (64-bit) > >> >>>>> > >> >>>>> locale: > >> >>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > >> >>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > >> >>>>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > >> >>>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > >> >>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C > >> >>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > >> >>>>> > >> >>>>> attached base packages: > >> >>>>> [1] stats graphics grDevices utils datasets methods > base > >> >>>>> > >> >>>>> other attached packages: > >> >>>>> [1] XML_3.2-0 annotate_1.29.4 > >> >>>>> AnnotationDbi_1.13.21 > >> >>>>> [4] Biobase_2.11.10 > >> >>>>> > >> >>>>> loaded via a namespace (and not attached): > >> >>>>> [1] DBI_0.2-5 RSQLite_0.9-4 tools_2.13.0 xtable_1.5-6 > >> >>>>> > >> >>>>> _______________________________________________ > >> >>>>> Bioconductor mailing list > >> >>>>> Bioconductor@r-project.org > >> >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> >>>>> Search the archives: > >> >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> >>>> > >> >>>> > >> >>> > >> >> > >> >> _______________________________________________ > >> >> Bioconductor mailing list > >> >> Bioconductor@r-project.org > >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> >> Search the archives: > >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > >> > > >> > -- > >> > Computational Biology > >> > Fred Hutchinson Cancer Research Center > >> > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > >> > > >> > Location: M1-B861 > >> > Telephone: 206 667-2793 > >> > > > > > > [[alternative HTML version deleted]]

ADD REPLY • link 14.4 years ago Vincent J. Carey, Jr. 6.7k

Login before adding your answer.