I wish to process full-texts of many articles using europepmc::epmc_ftxt (so that I can later use tidypmc::pmc_text and tidypmc::separate_text). I find that the R coding (a) below is much too slow in the first step before processing: mypmc is a vector of PMCID numbers e.g. c("PMC11102434", "PMC11127444", etc. I've used the loop and error handling to stop the process from stopping with an error - but perhaps this could be faster (?) Also, perhaps ftp would be faster ? At https://europepmc, ord/ftp/oa/ there are many .gz files which possibly I could access one at a time in a loop. Testing for one of these files, the coding (b) seems partially successful (?) but then when I use europepmc:: epmc_ftxt(filez) it throws the error "Error in europepmc::epmc_ftxt(filez) : Not Found (HTTP 404). Failed to retrieve full text." This was expected (it is bound not to be able to retrieve some full-texts) but how can I prevent this error preventing completion of successful retrievals assuming that it can actually retrieve something successfully via this method (?). I've also tried manually unzipping the file created using UnArchiver - epmc_ftxt also throws an error and I've tried gzfile(filez) but this creates a .gz.gz file. Any ideas ?
## (a) mypmc is a vector of PMCID numbers e.g. c("PMC11102434", ...
library("europepmc")
docs <- list();
for (i in 1:1000) {
docs[[i]] <- tryCatch(europepmc::epmc_ftxt(mypmc[[i]]), error = function(e) {NA})
}
## (b)
filez <- "PMC13900_PMC17829.xml.gz"
url <- paste0("https://europepmc.org/ftp/oa/", filez)
zipped.file <- download.file(url = url, destfile = filez, method = "auto")
> sessionInfo( )
R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.4
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Berlin
tzcode source: internal