paper - download

paper - download - pubmed

0

Entering edit mode

nooshin ▴ 300

@nooshin-5239

Last seen 5.4 years ago

Hi, I actually need to download pdfs through R code. The thing which I want to do is that, search for a paper in pubmed, which is possible by using GetPubMed function in the package "NCBI2R?". GetPubMed(searchterm, file = "", download = TRUE , showurl = FALSE, xldiv = ";", hyper = "HYPERLINK", MaxRet = 30000, sme = FALSE, smt = FALSE, quiet = TRUE, batchsize=500,descHead=FALSE) With this function I can not download the pdfs for all hits, although if I go to the pubmed, I can download it. So, the problem is not that, for each paper I have to download the pdfs (which are available if I go to the pubmed and search directly there) and the corresponding supplementary files. Could anybody please help me how I can solve this? Thanks, Nooshin

GO GO • 2.4k views

ADD COMMENT • link updated 11.3 years ago by James W. MacDonald 65k • written 11.3 years ago by nooshin ▴ 300

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 2 days ago

United States

Hi Nooshin, NCBI2R is a CRAN package, not BioC. You should either contact the maintainer directly, or ask on R-help. Best, Jim On 1/15/2013 6:44 AM, Nooshin Omranian wrote: > > Hi, > > I actually need to download pdfs through R code. The thing which I > want to do is that, search for a paper in pubmed, which is possible by > using GetPubMed function in the package "NCBI2R?". > > GetPubMed(searchterm, file = "", download = TRUE , showurl = FALSE, > xldiv = > ";", hyper = "HYPERLINK", MaxRet = 30000, sme = FALSE, smt = FALSE, > quiet = TRUE, batchsize=500,descHead=FALSE) > > With this function I can not download the pdfs for all hits, although > if I go to the pubmed, I can download it. > > So, the problem is not that, for each paper I have to download the > pdfs (which are available if I go to the pubmed and search directly > there) and the corresponding supplementary files. > > Could anybody please help me how I can solve this? > > Thanks, > Nooshin > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 11.3 years ago James W. MacDonald 65k

0

Entering edit mode

Hi Jim, I know, but I sent to bioc if anybody in this list knows any bioc package which is helpful in my case. Thanks, Nooshin On 1/15/2013 2:56 PM, James W. MacDonald wrote: > Hi Nooshin, > > NCBI2R is a CRAN package, not BioC. You should either contact the > maintainer directly, or ask on R-help. > > Best, > > Jim > > > > On 1/15/2013 6:44 AM, Nooshin Omranian wrote: >> >> Hi, >> >> I actually need to download pdfs through R code. The thing which I >> want to do is that, search for a paper in pubmed, which is possible >> by using GetPubMed function in the package "NCBI2R?". >> >> GetPubMed(searchterm, file = "", download = TRUE , showurl = FALSE, >> xldiv = >> ";", hyper = "HYPERLINK", MaxRet = 30000, sme = FALSE, smt = FALSE, >> quiet = TRUE, batchsize=500,descHead=FALSE) >> >> With this function I can not download the pdfs for all hits, although >> if I go to the pubmed, I can download it. >> >> So, the problem is not that, for each paper I have to download the >> pdfs (which are available if I go to the pubmed and search directly >> there) and the corresponding supplementary files. >> >> Could anybody please help me how I can solve this? >> >> Thanks, >> Nooshin >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 11.3 years ago nooshin ▴ 300

0

Entering edit mode

>> >> So, the problem is not that, for each paper I have to download the >> pdfs (which are available if I go to the pubmed and search directly >> there) and the corresponding supplementary files. >> Nooshin, You can download pdfs from Pubmed Central if you have one PMC id. download.file([1]"http://www.ncbi.nlm.nih.gov/pmc/articles/PMC34463 03/pdf", "PMC3446303.pdf") However, NCBI clearly states that you may NOT use any kind of automated process to download articles in bulk from the main PMC site, so I would use the ftp site for Open Access articles (see [2]http://www.ncbi.nlm.nih.gov/pmc/tools/ftp ). The ftp site also has the supplemental files included. First, read the list of available files pmcftp <- read.delim( [3]"ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.txt" , skip=1, header=FALSE, stringsAsFactors=FALSE) nrow(pmcftp) [1] 552677 names(pmcftp)<-c("dir", "citation", "id") Then match PMC ids and loop through the results to download and untar the files y <- subset(pmcftp, id %in% c("PMC3446303", "PMC3463124") ) y 509377 75/e9/Genome_Biol_2012_Apr_24_13(4)_R29.tar.gz Genome Biol. 2012 Apr 24; 13(4):R29 PMC3446303 514389 04/0f/Bioinformatics_2012_Oct_1_28(19)_2532-2533.tar.gz Bioinformatics. 2012 Oct 1; 28(19):2532-2533 PMC3463124 for( i in 1: nrow(y) ){ destfile <- paste(y$id[i], ".tar.gz", sep="") download.file( paste([4]"ftp://ftp.ncbi.nlm.nih.gov/pub/pmc", y$dir[i], sep="/"), destfile ) untar( destfile, compressed=TRUE) } Also, if you need to get a list of PMC ids in R, I have a package called genomes on BioC that includes E-utility scripts. So something like this query would get the 49 pmc ids for articles with Bioconductor in the title. x2<- esummary(esearch("bioconductor[TITLE] AND open access[FILTER]", db="pmc"), version="2.0") Esummary uses a generic parser by default, so PMCids are mashed together in a column with other Ids ids<-gsub(".*(PMC[0-9]*)", "\\1", x2$ArticleIds) y <- subset(pmcftp, id %in% ids) You could run esummary and add parse=FALSE to get the XML results and parse that any way you like. Or even use esearch and set usehistory="n" ids2 <- paste("PMC", esearch("bioconductor[TITLE] AND open access[FILTER]", db="pmc", usehistory="n", retmax=100), sep="") Chris -- Chris Stubben Los Alamos National Lab Bioscience Division MS M888 Los Alamos, NM 87545 References 1. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3446303/pdf 2. http://www.ncbi.nlm.nih.gov/pmc/tools/ftp 3. ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.txt 4. ftp://ftp.ncbi.nlm.nih.gov/pub/pmc

ADD REPLY • link 11.3 years ago stubben ▴ 80

0

Entering edit mode

Sorry, not sure why references[1] were added automatically to html links within the code, but this reply should work if you copy and paste (I hope). Chris >> >> So, the problem is not that, for each paper I have to download the >> pdfs (which are available if I go to the pubmed and search directly >> there) and the corresponding supplementary files. >> Nooshin, You can download pdfs from Pubmed Central if you have one PMC id. download.file( "http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3446303/pdf", "PMC3446303.pdf") However, NCBI clearly states that you may NOT use any kind of automated process to download articles in bulk from the main PMC site, so I would use the ftp site for Open Access articles (see http://www.ncbi.nlm.nih.gov/pmc/tools/ftp ). The ftp site also has the supplemental files included. First, read the list of available files pmcftp <- read.delim( "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.txt" , skip=1, header=FALSE, stringsAsFactors=FALSE) nrow(pmcftp) [1] 552677 names(pmcftp)<-c("dir", "citation", "id") Then match PMC ids and loop through the results to download and untar the files y <- subset(pmcftp, id %in% c("PMC3446303", "PMC3463124") ) y 509377 75/e9/Genome_Biol_2012_Apr_24_13(4)_R29.tar.gz Genome Biol. 2012 Apr 24; 13(4):R29 PMC3446303 514389 04/0f/Bioinformatics_2012_Oct_1_28(19)_2532-2533.tar.gz Bioinformatics. 2012 Oct 1; 28(19):2532-2533 PMC3463124 for( i in 1: nrow(y) ){ destfile <- paste(y$id[i], ".tar.gz", sep="") download.file( paste("ftp://ftp.ncbi.nlm.nih.gov/pub/pmc", y$dir[i], sep="/"), destfile ) untar( destfile, compressed=TRUE) } Also, if you need to get a list of PMC ids in R, I have a package called genomes on BioC that includes E-utility scripts. So something like this query would get the 49 pmc ids for articles with Bioconductor in the title. x2 <- esummary(esearch("bioconductor[TITLE] AND open access[FILTER]", db="pmc"), version="2.0") Esummary uses a generic parser by default, so PMCids are mashed together in a column with other Ids ids <-gsub(".*(PMC[0-9]*)", "\\1", x2$ArticleIds) y <- subset(pmcftp, id %in% ids) You could run esummary and add parse=FALSE to get the XML results and parse that any way you like. Or even use esearch and set usehistory="n" ids2 <- paste("PMC", esearch("bioconductor[TITLE] AND open access[FILTER]", db="pmc", usehistory="n", retmax=100), sep="") -- Chris Stubben Los Alamos National Lab Bioscience Division MS M888 Los Alamos, NM 87545

ADD REPLY • link 11.3 years ago stubben ▴ 80

0

Entering edit mode

Hi Chris , I'm so thankful for your very nice explanation and clue. I will come back to you if I have more questions or problem in the code. Many thanks! Nooshin On 1/16/2013 7:24 PM, Chris Stubben wrote: > Sorry, not sure why references[1] were added automatically to html > links within the code, but this reply should work if you copy and > paste (I hope). > Chris > > >> > >> So, the problem is not that, for each paper I have to download the > >> pdfs (which are available if I go to the pubmed and search directly > >> there) and the corresponding supplementary files. > >> > Nooshin, > You can download pdfs from Pubmed Central if you have one PMC id. > download.file( > "http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3446303/pdf", > "PMC3446303.pdf") > > However, NCBI clearly states that you may NOT use any kind of > automated process to download articles in bulk from the main PMC site, > so I would use the ftp site for Open Access articles (see > http://www.ncbi.nlm.nih.gov/pmc/tools/ftp ). The ftp site also has > the supplemental files included. First, read the list of available files > > pmcftp <- read.delim( > "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.txt" , skip=1, > header=FALSE, stringsAsFactors=FALSE) > nrow(pmcftp) > [1] 552677 > names(pmcftp)<-c("dir", "citation", "id") > > Then match PMC ids and loop through the results to download and untar > the files > > y <- subset(pmcftp, id %in% c("PMC3446303", "PMC3463124") ) > y > 509377 75/e9/Genome_Biol_2012_Apr_24_13(4)_R29.tar.gz Genome > Biol. 2012 Apr 24; 13(4):R29 PMC3446303 > 514389 04/0f/Bioinformatics_2012_Oct_1_28(19)_2532-2533.tar.gz > Bioinformatics. 2012 Oct 1; 28(19):2532-2533 PMC3463124 > > for( i in 1: nrow(y) ){ > destfile <- paste(y$id[i], ".tar.gz", sep="") > download.file( paste("ftp://ftp.ncbi.nlm.nih.gov/pub/pmc", y$dir[i], > sep="/"), destfile ) > untar( destfile, compressed=TRUE) > } > > Also, if you need to get a list of PMC ids in R, I have a package > called genomes on BioC that includes E-utility scripts. So something > like this query would get the 49 pmc ids for articles with > Bioconductor in the title. > > x2 <- esummary(esearch("bioconductor[TITLE] AND open access[FILTER]", > db="pmc"), version="2.0") > > Esummary uses a generic parser by default, so PMCids are mashed > together in a column with other Ids > ids <-gsub(".*(PMC[0-9]*)", "\\1", x2$ArticleIds) > y <- subset(pmcftp, id %in% ids) > > You could run esummary and add parse=FALSE to get the XML results and > parse that any way you like. Or even use esearch and set usehistory="n" > ids2 <- paste("PMC", esearch("bioconductor[TITLE] AND open > access[FILTER]", db="pmc", usehistory="n", retmax=100), sep="") > >

ADD REPLY • link 11.3 years ago nooshin ▴ 300

0

Entering edit mode

Hi Chris, Actually what you told is working perfectly for the PMC ids, but not PM ids. Like if I need to get the PDFs for this PM ids : 10417722, what should I do? From my institute, I'm allowed to download papers from various journals, and the problem is now, I can only get the papers annotated with PMC ids but not with PM ids. I download this file, PMC-ids.csv, even the above PM ids is not listed in this file. I don't know if there is anyway possible to download the pdfs and supplementary files which are not publicly available but are available through the network from my institute? Or through the PM ids? I have all PMids that I need to get the pdfs. I need to automate it. I would be thankful if anybody can give me some hints or help. Thanks and cheers, Nooshin On 1/16/2013 7:24 PM, Chris Stubben wrote: > Sorry, not sure why references[1] were added automatically to html > links within the code, but this reply should work if you copy and > paste (I hope). > Chris > > >> > >> So, the problem is not that, for each paper I have to download the > >> pdfs (which are available if I go to the pubmed and search directly > >> there) and the corresponding supplementary files. > >> > Nooshin, > You can download pdfs from Pubmed Central if you have one PMC id. > download.file( > "http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3446303/pdf", > "PMC3446303.pdf") > > However, NCBI clearly states that you may NOT use any kind of > automated process to download articles in bulk from the main PMC site, > so I would use the ftp site for Open Access articles (see > http://www.ncbi.nlm.nih.gov/pmc/tools/ftp ). The ftp site also has > the supplemental files included. First, read the list of available files > > pmcftp <- read.delim( > "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.txt" , skip=1, > header=FALSE, stringsAsFactors=FALSE) > nrow(pmcftp) > [1] 552677 > names(pmcftp)<-c("dir", "citation", "id") > > Then match PMC ids and loop through the results to download and untar > the files > > y <- subset(pmcftp, id %in% c("PMC3446303", "PMC3463124") ) > y > 509377 75/e9/Genome_Biol_2012_Apr_24_13(4)_R29.tar.gz Genome > Biol. 2012 Apr 24; 13(4):R29 PMC3446303 > 514389 04/0f/Bioinformatics_2012_Oct_1_28(19)_2532-2533.tar.gz > Bioinformatics. 2012 Oct 1; 28(19):2532-2533 PMC3463124 > > for( i in 1: nrow(y) ){ > destfile <- paste(y$id[i], ".tar.gz", sep="") > download.file( paste("ftp://ftp.ncbi.nlm.nih.gov/pub/pmc", y$dir[i], > sep="/"), destfile ) > untar( destfile, compressed=TRUE) > } > > Also, if you need to get a list of PMC ids in R, I have a package > called genomes on BioC that includes E-utility scripts. So something > like this query would get the 49 pmc ids for articles with > Bioconductor in the title. > > x2 <- esummary(esearch("bioconductor[TITLE] AND open access[FILTER]", > db="pmc"), version="2.0") > > Esummary uses a generic parser by default, so PMCids are mashed > together in a column with other Ids > ids <-gsub(".*(PMC[0-9]*)", "\\1", x2$ArticleIds) > y <- subset(pmcftp, id %in% ids) > > You could run esummary and add parse=FALSE to get the XML results and > parse that any way you like. Or even use esearch and set usehistory="n" > ids2 <- paste("PMC", esearch("bioconductor[TITLE] AND open > access[FILTER]", db="pmc", usehistory="n", retmax=100), sep="") > > [[alternative HTML version deleted]]

ADD REPLY • link 11.3 years ago nooshin ▴ 300

0

Entering edit mode

Actually what you told is working perfectly for the PMC ids, but not PM ids. Like if I need to get the PDFs for this PM ids : 10417722, what should I do? >From my institute, I'm allowed to download papers from various journals, and the problem is now, I can only get the papers annotated with PMC ids but not with PM ids. There are a few ways to get PMC ids from pubmed ids using E-utilities and the genomes package. # E-link - for a list of links see subset( einfo("pubmed", links=TRUE), DbTo=="pmc") # dbfrom = pubmed by default. elink(14769935, dbto="pmc", cmd="neighbor", linkname="pubmed_pmc") [1] 357076 # = PMC357076 # or if no PMC id available elink(10417722, dbto="pmc", cmd="neighbor", linkname="pubmed_pmc") numeric(0) # or use E-fetch and get the abstract - the PMCID is listed before the PMID and you could use grep to grab that. Again pubmed is the default db efetch(14769935, rettype="abstract") [26] "PMCID: PMC357076" [27] "PMID: 14769935 [PubMed - indexed for MEDLINE]" # or get XML from efetch x <- efetch(14769935, retmode="xml") doc<-xmlParse(x) # requires XML package xpathSApply(doc, '//ArticleId[@IdType="pmc"]', xmlValue) [1] "PMC357076" If the Pubmed Id is not linked to PMC, you could read the Pubmed results page and check if there is a link to a full text article from the publisher. url <- [1]"http://www.ncbi.nlm.nih.gov/pubmed/?term=10417722" doc <- xmlParse(url) ## the results page includes a namespace, so queries look awful xpathSApply(doc, '//x:div[@class="icons"]/x:div/x:a', xmlGetAttr, "href", namespaces = c("x" = [2]"http://www.w3.org/1999/xhtml")) [1] [3]"http://onlinelibrary.wiley.com/resolve/openurl?genre=article&si d=nlm:pub med&issn=0960-7412&date=1999&volume=19&issue=1&spage=9" You could read that link and find another link to download the pdf , which is probably different for each publisher... [4]http://onlinelibrary.wiley.com/doi/10.1046/j.1365-313X.1999.0049 1.x/pdf Chris References 1. http://www.ncbi.nlm.nih.gov/pubmed/?term=10417722 2. http://www.w3.org/1999/xhtml 3. http://onlinelibrary.wiley.com/resolve/openurl?genre=article&sid =nlm:pubmed&issn=0960-7412&date=1999&volume=19&issue=1&spage=9 4. http://onlinelibrary.wiley.com/doi/10.1046/j.1365-313X.1999.0049 1.x/pdf

ADD REPLY • link 11.3 years ago stubben ▴ 80

Login before adding your answer.