Deleted:Accessing PubMed Central full-texts via FTP?
1
0
Entering edit mode
Abiologist • 0
@2b534cfa
Last seen 8 weeks ago
Poland

I wish to process full-texts of many articles using europepmc::epmc_ftxt (so that I can later use tidypmc::pmc_text and tidypmc::separate_text). I find that the R coding (a) below is much too slow in the first step before processing: mypmc is a vector of PMCID numbers e.g. c("PMC11102434", "PMC11127444", etc. I've used the loop and error handling to stop the process from stopping with an error - but perhaps this could be faster (?) Also, perhaps ftp would be faster ? At https://europepmc, ord/ftp/oa/ there are many .gz files which possibly I could access one at a time in a loop. Testing for one of these files, the coding (b) seems partially successful (?) but then when I use europepmc:: epmc_ftxt(filez) it throws the error "Error in europepmc::epmc_ftxt(filez) : Not Found (HTTP 404). Failed to retrieve full text." This was expected (it is bound not to be able to retrieve some full-texts) but how can I prevent this error preventing completion of successful retrievals assuming that it can actually retrieve something successfully via this method (?). I've also tried manually unzipping the file created using UnArchiver - epmc_ftxt also throws an error and I've tried gzfile(filez) but this creates a .gz.gz file. Any ideas ?

## (a) mypmc is a vector of PMCID numbers e.g. c("PMC11102434", ...
library("europepmc")
docs <- list();  
for (i in 1:1000) {
docs[[i]] <- tryCatch(europepmc::epmc_ftxt(mypmc[[i]]), error = function(e) {NA})
}
## (b)
filez <- "PMC13900_PMC17829.xml.gz"
url <- paste0("https://europepmc.org/ftp/oa/", filez)
zipped.file <- download.file(url = url, destfile = filez, method = "auto")

> sessionInfo( )
R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Berlin
tzcode source: internal
pdInfoBuilder • 215 views
ADD COMMENT
This thread is not open. No new answers may be added
Traffic: 782 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6