help with PubMed Central OAI

0

Entering edit mode

stubben ▴ 80

@stubben-4185

Last seen 9.6 years ago

I've been using Efetch to get some full text articles from Pubmed Central, which works fine... url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PM C2784878" x<-readLines(url) doc <- xmlParse(x ) # requires XML package xpathSApply(doc, "//abstract", xmlValue) [1] "The majority of all genes have so far been identified and annotated systematically through in silico gene finding. Here we report the finding of 3662 strand-specific transcriptionally active regions (TARs) in the genome of Bacillus subtilis by the use of tiling arrays. I recently noticed the PMC copyright says to use the FTP or OAI service for any "automated" retrievals, so I thought I would try OAI, but I can't get the same xpath queries to work. url <- "http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=GetRecord&metadataP refix=pmc&identifier=oai:pubmedcentral.nih.gov:2784878" x2<-readLines(url) # will warn about incomplete final line doc2 <- xmlParse(x2 ) xpathSApply(doc2, "//abstract", xmlValue) list() This query does work so I know there's an abstract tag. table(xpathSApply(doc2, "//*", xmlName)) abstract ack addr-line aff article article-categories 1 1 1 1 1 1 article-id article-meta article-title author-notes back body 3 1 79 1 1 1 caption contrib contrib-group copyright-statement corresp date 7 3 1 1 1 1 Thanks for any help. Chris Stubben

Bacillus subtilis Bacillus subtilis • 1.0k views

ADD COMMENT • link updated 12.0 years ago by Duncan Temple Lang ▴ 110 • written 12.0 years ago by stubben ▴ 80

0

Entering edit mode

Duncan Temple Lang ▴ 110

@duncan-temple-lang-1540

Last seen 9.6 years ago

Hi Chris The problem is that the <abstract> node has a namespace. So the following will do what you want (and also avoids using readLines() by retrieving the URL directly in xmlParse().) url <- "http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=GetRecord&metadataP refix=pmc&identifier=oai:pubmedcentral.nih.gov:2784878" doc2 = xmlParse(url) getNodeSet(doc2, "//x:abstract", c("x" = "http://dtd.nlm.nih.gov/2.0/xsd/archivearticle")) or xpathSApply(doc2, "//x:abstract", xmlValue, namespaces = c("x" = "http://dtd.nlm.nih.gov/2.0/xsd/archivearticle")) The namespaces is defined on the

node. D. On 4/20/12 10:33 AM, Chris Stubben wrote: > I've been using Efetch to get some full text articles from Pubmed Central, which works fine... > > url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db= pmc&id=PMC2784878" > x<-readLines(url) > doc <- xmlParse(x ) # requires XML package > xpathSApply(doc, "//abstract", xmlValue) > [1] "The majority of all genes have so far been identified and annotated systematically through in silico gene finding. > Here we report the finding of 3662 strand-specific transcriptionally active regions (TARs) in the genome of Bacillus > subtilis by the use of tiling arrays. > > > I recently noticed the PMC copyright says to use the FTP or OAI service for any "automated" retrievals, so I thought I > would try OAI, but I can't get the same xpath queries to work. > > url <- > "http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=GetRecord&metadat aPrefix=pmc&identifier=oai:pubmedcentral.nih.gov:2784878" > > x2<-readLines(url) # will warn about incomplete final line > doc2 <- xmlParse(x2 ) > xpathSApply(doc2, "//abstract", xmlValue) > list() > > This query does work so I know there's an abstract tag. table(xpathSApply(doc2, "//*", xmlName)) > > abstract ack addr-line aff article > article-categories > 1 1 1 1 > 1 1 > article-id article-meta article-title author-notes > back body > 3 1 79 1 > 1 1 > caption contrib contrib-group copyright-statement > corresp date > 7 3 1 1 > 1 1 > > Thanks for any help. > Chris Stubben > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.0 years ago Duncan Temple Lang ▴ 110

Login before adding your answer.