annotate() and genbank and XML
1
0
Entering edit mode
Andrew Yee ▴ 350
@andrew-yee-2667
Last seen 9.6 years ago
Hi, I'm looking for some guidance in terms of parsing the XML output from a genbank query. result <- genbank('NM_000610', disp='data', type='uid') I'm trying to figure out how to use the XML package in order to parse out the "sig_peptide" field from the XML output from the genbank query. Any pointers or suggestions would be appreciated, as I'm new to XML. Thanks, Andrew > sessionInfo() R version 2.13.0 (2011-04-13) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] XML_3.2-0 annotate_1.29.4 AnnotationDbi_1.13.21 [4] Biobase_2.11.10 loaded via a namespace (and not attached): [1] DBI_0.2-5 RSQLite_0.9-4 tools_2.13.0 xtable_1.5-6
• 1.1k views
ADD COMMENT
0
Entering edit mode
@vincent-j-carey-jr-4
Last seen 6 weeks ago
United States
I don't see a sig_peptide field. You should have a look at http://www.omegahat.org/RSXML/shortIntro.html and references therein. It has been a long time since I did anything with XML per se. We did a certain amount of exposition in Chapter 8 of the 2005 Springer monograph. Since then more XPath support has come in and many new ideas help distance users from details of XML processing. To illustrate a bit with your example, I trapped the actual document reference zz = xmlInternalTreeParse(" http://www.ncbi.nih.gov/entrez/eutils/efetch.fcgi?tool=bioconductor&re ttype=xml&retmode=text&db=Nucleotide&id=NM_000610 ") and then performed an XPath query > getNodeSet(zz, "//Seq-interval_from") [[1]] <seq-interval_from>3244</seq-interval_from> [[2]] <seq-interval_from>3328</seq-interval_from> [[3]] <seq-interval_from>5695</seq-interval_from> and so on. I don't recall how to do a relatively simple task like "enumerate all tags in use in a document" but it can be done with the XML package tools. I think it will be more effective to isolate the use case and see how to use eutils to solve it fairly directly as opposed to wading through XML, but perhaps wading is inevitable. On Wed, Sep 21, 2011 at 12:29 PM, Andrew Yee <yee@post.harvard.edu> wrote: > Hi, I'm looking for some guidance in terms of parsing the XML output > from a genbank query. > > result <- genbank('NM_000610', disp='data', type='uid') > > I'm trying to figure out how to use the XML package in order to parse > out the "sig_peptide" field from the XML output from the genbank > query. > > Any pointers or suggestions would be appreciated, as I'm new to XML. > > Thanks, > Andrew > > > > > sessionInfo() > R version 2.13.0 (2011-04-13) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] XML_3.2-0 annotate_1.29.4 AnnotationDbi_1.13.21 > [4] Biobase_2.11.10 > > loaded via a namespace (and not attached): > [1] DBI_0.2-5 RSQLite_0.9-4 tools_2.13.0 xtable_1.5-6 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Thanks for the reply. ?I guess on a broader level, is there a way to extract the sig_peptide field from http://www.ncbi.nlm.nih.gov/nuccore/NM_000610.3 I'm trying to figure out why the document reference in Carey's example doesn't contain "sig_peptide" yet is visible on that web page. Perhaps there is another method of getting the annotation for sig_peptide within GenBank? Thanks, Andrew On Wed, Sep 21, 2011 at 4:07 PM, Vincent Carey <stvjc at="" channing.harvard.edu=""> wrote: > I don't see a sig_peptide field.? You should have a look at > > http://www.omegahat.org/RSXML/shortIntro.html > > and references therein. > > It has been a long time since I did anything with XML per se. We did a > certain amount of exposition in Chapter 8 > of the 2005 Springer monograph.? Since then more XPath support has come in > and many new ideas help distance users from > details of XML processing.? To illustrate a bit with your example, I trapped > the actual document reference > > zz = > xmlInternalTreeParse("http://www.ncbi.nih.gov/entrez/eutils/efetch.f cgi?tool=bioconductor&rettype=xml&retmode=text&db=Nucleotide&id=NM_000 610") > > and then performed an XPath query > >> getNodeSet(zz, "//Seq-interval_from") > [[1]] > <seq-interval_from>3244</seq-interval_from> > > [[2]] > <seq-interval_from>3328</seq-interval_from> > > [[3]] > <seq-interval_from>5695</seq-interval_from> > > and so on.? I don't recall how to do a relatively simple task like > "enumerate all tags in use in a document" but it can be done with the XML > package tools.? I think it will be more effective to isolate the use case > and see how to use eutils to solve it fairly directly as opposed to wading > through XML, but perhaps wading is inevitable. > > > On Wed, Sep 21, 2011 at 12:29 PM, Andrew Yee <yee at="" post.harvard.edu=""> wrote: >> >> Hi, I'm looking for some guidance in terms of parsing the XML output >> from a genbank query. >> >> result <- genbank('NM_000610', disp='data', type='uid') >> >> I'm trying to figure out how to use the XML package in order to parse >> out the "sig_peptide" field from the XML output from the genbank >> query. >> >> Any pointers or suggestions would be appreciated, as I'm new to XML. >> >> Thanks, >> Andrew >> >> >> >> > sessionInfo() >> R version 2.13.0 (2011-04-13) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> ?[1] LC_CTYPE=en_US.UTF-8 ? ? ? LC_NUMERIC=C >> ?[3] LC_TIME=en_US.UTF-8 ? ? ? ?LC_COLLATE=en_US.UTF-8 >> ?[5] LC_MONETARY=C ? ? ? ? ? ? ?LC_MESSAGES=en_US.UTF-8 >> ?[7] LC_PAPER=en_US.UTF-8 ? ? ? LC_NAME=C >> ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base >> >> other attached packages: >> [1] XML_3.2-0 ? ? ? ? ? ? annotate_1.29.4 ? ? ? AnnotationDbi_1.13.21 >> [4] Biobase_2.11.10 >> >> loaded via a namespace (and not attached): >> [1] DBI_0.2-5 ? ? RSQLite_0.9-4 tools_2.13.0 ?xtable_1.5-6 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >
ADD REPLY

Login before adding your answer.

Traffic: 1084 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6