annotate() and genbank and XML

0

Entering edit mode

Andrew Yee ▴ 350

@andrew-yee-2667

Last seen 11.3 years ago

Hi, I'm looking for some guidance in terms of parsing the XML output from a genbank query. result <- genbank('NM_000610', disp='data', type='uid') I'm trying to figure out how to use the XML package in order to parse out the "sig_peptide" field from the XML output from the genbank query. Any pointers or suggestions would be appreciated, as I'm new to XML. Thanks, Andrew > sessionInfo() R version 2.13.0 (2011-04-13) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] XML_3.2-0 annotate_1.29.4 AnnotationDbi_1.13.21 [4] Biobase_2.11.10 loaded via a namespace (and not attached): [1] DBI_0.2-5 RSQLite_0.9-4 tools_2.13.0 xtable_1.5-6

• 1.4k views

ADD COMMENT • link updated 14.2 years ago by Vincent J. Carey, Jr. 6.7k • written 14.2 years ago by Andrew Yee ▴ 350

0

Entering edit mode

Vincent J. Carey, Jr. 6.7k

@vincent-j-carey-jr-4

Last seen 12 weeks ago

United States

I don't see a sig_peptide field. You should have a look at http://www.omegahat.org/RSXML/shortIntro.html and references therein. It has been a long time since I did anything with XML per se. We did a certain amount of exposition in Chapter 8 of the 2005 Springer monograph. Since then more XPath support has come in and many new ideas help distance users from details of XML processing. To illustrate a bit with your example, I trapped the actual document reference zz = xmlInternalTreeParse(" http://www.ncbi.nih.gov/entrez/eutils/efetch.fcgi?tool=bioconductor&re ttype=xml&retmode=text&db=Nucleotide&id=NM_000610 ") and then performed an XPath query > getNodeSet(zz, "//Seq-interval_from") [[1]] <seq-interval_from>3244</seq-interval_from> [[2]] <seq-interval_from>3328</seq-interval_from> [[3]] <seq-interval_from>5695</seq-interval_from> and so on. I don't recall how to do a relatively simple task like "enumerate all tags in use in a document" but it can be done with the XML package tools. I think it will be more effective to isolate the use case and see how to use eutils to solve it fairly directly as opposed to wading through XML, but perhaps wading is inevitable. On Wed, Sep 21, 2011 at 12:29 PM, Andrew Yee <yee@post.harvard.edu> wrote: > Hi, I'm looking for some guidance in terms of parsing the XML output > from a genbank query. > > result <- genbank('NM_000610', disp='data', type='uid') > > I'm trying to figure out how to use the XML package in order to parse > out the "sig_peptide" field from the XML output from the genbank > query. > > Any pointers or suggestions would be appreciated, as I'm new to XML. > > Thanks, > Andrew > > > > > sessionInfo() > R version 2.13.0 (2011-04-13) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] XML_3.2-0 annotate_1.29.4 AnnotationDbi_1.13.21 > [4] Biobase_2.11.10 > > loaded via a namespace (and not attached): > [1] DBI_0.2-5 RSQLite_0.9-4 tools_2.13.0 xtable_1.5-6 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 14.2 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

Thanks for the reply. ?I guess on a broader level, is there a way to extract the sig_peptide field from http://www.ncbi.nlm.nih.gov/nuccore/NM_000610.3 I'm trying to figure out why the document reference in Carey's example doesn't contain "sig_peptide" yet is visible on that web page. Perhaps there is another method of getting the annotation for sig_peptide within GenBank? Thanks, Andrew On Wed, Sep 21, 2011 at 4:07 PM, Vincent Carey <stvjc at="" channing.harvard.edu=""> wrote: > I don't see a sig_peptide field.? You should have a look at > > http://www.omegahat.org/RSXML/shortIntro.html > > and references therein. > > It has been a long time since I did anything with XML per se. We did a > certain amount of exposition in Chapter 8 > of the 2005 Springer monograph.? Since then more XPath support has come in > and many new ideas help distance users from > details of XML processing.? To illustrate a bit with your example, I trapped > the actual document reference > > zz = > xmlInternalTreeParse("http://www.ncbi.nih.gov/entrez/eutils/efetch.f cgi?tool=bioconductor&rettype=xml&retmode=text&db=Nucleotide&id=NM_000 610") > > and then performed an XPath query > >> getNodeSet(zz, "//Seq-interval_from") > [[1]] > <seq-interval_from>3244</seq-interval_from> > > [[2]] > <seq-interval_from>3328</seq-interval_from> > > [[3]] > <seq-interval_from>5695</seq-interval_from> > > and so on.? I don't recall how to do a relatively simple task like > "enumerate all tags in use in a document" but it can be done with the XML > package tools.? I think it will be more effective to isolate the use case > and see how to use eutils to solve it fairly directly as opposed to wading > through XML, but perhaps wading is inevitable. > > > On Wed, Sep 21, 2011 at 12:29 PM, Andrew Yee <yee at="" post.harvard.edu=""> wrote: >> >> Hi, I'm looking for some guidance in terms of parsing the XML output >> from a genbank query. >> >> result <- genbank('NM_000610', disp='data', type='uid') >> >> I'm trying to figure out how to use the XML package in order to parse >> out the "sig_peptide" field from the XML output from the genbank >> query. >> >> Any pointers or suggestions would be appreciated, as I'm new to XML. >> >> Thanks, >> Andrew >> >> >> >> > sessionInfo() >> R version 2.13.0 (2011-04-13) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> ?[1] LC_CTYPE=en_US.UTF-8 ? ? ? LC_NUMERIC=C >> ?[3] LC_TIME=en_US.UTF-8 ? ? ? ?LC_COLLATE=en_US.UTF-8 >> ?[5] LC_MONETARY=C ? ? ? ? ? ? ?LC_MESSAGES=en_US.UTF-8 >> ?[7] LC_PAPER=en_US.UTF-8 ? ? ? LC_NAME=C >> ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base >> >> other attached packages: >> [1] XML_3.2-0 ? ? ? ? ? ? annotate_1.29.4 ? ? ? AnnotationDbi_1.13.21 >> [4] Biobase_2.11.10 >> >> loaded via a namespace (and not attached): >> [1] DBI_0.2-5 ? ? RSQLite_0.9-4 tools_2.13.0 ?xtable_1.5-6 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD REPLY • link 14.2 years ago Andrew Yee ▴ 350

Login before adding your answer.