Entering edit mode
Hi, I am trying to make an Org package from NCBI database. However, I met an URL access ERROR when running AnnotationForge::makeOrgPackageFromNCBI
My code is:
> makeOrgPackageFromNCBI(version = "0.1",
+ author = "Chang Liu <liuchangbio@163.com>",
+ maintainer = "Chang Liu <liuchangbio@163.com>",
+ outputDir = ".",
+ NCBIFilesDir = getwd(),
+ tax_id = "703339", #金黄色葡萄球菌
+ genus = "Staphycoloccus",
+ species = "aureus")
Here is the ERROR report:
If files are not cached locally this may take awhile to assemble a 33 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.Please also see AnnotationHub for some pre-builtOrgDb downloads
preparing data from NCBI ...
starting download for
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
rebuilding the cache
Error in .tryDL(url, tmp) : url access failed after
4
attempts; url:
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz
Hi James,
Thank you very much for your response. I have done as you suggested, but it is still not work with same ERROR code.
To test this error, I used the official document code, but it still cannot connect.
I am confused about this problem. Because my network is OK, and "makeOrgPackageFromNCBI" is also the official recommended method, this error seems difficult to understand。
Here is my code:
Thank you for your attention :)
AFAIK you should set
options(timeout = 5000)
before you call the functionmakeOrgPackageFromNCBI
, and not use it as argument withinmakeOrgPackageFromNCBI
...On Windows the default maximum request time is 60 seconds, and you thus overrule this value.
Hi,
Thank you for your reply.
I have tried to follow the advice you gave, but it still does not work. This does not look like an easy ERROR to fix.
I did a search of the bioconductor forum and no one seemed to be asking the same question.
Same question was asked on github, but it wasn't solved either.
Anyway, thank you for your advice. Thank you very much!
Does this work?
It looks like you might be blocked by a firewall.
Your answer captures the essence of the problem.
At first it did not work, indicating that there was a problem with my network. After improving my network conditions, this issue was resolved, But I have to admit that it's very slow and inconsistent.
And naturally, several subsequent steps were successfully performed, but eventually it stopped at a new problem.
This is my code:
What I get:
In case the problem is due to an occasional network problem, I've tried many times on different networks and haven't found a solution yet. I'll keep trying until it's solved because I really want to use the “ "makeOrgPackageFromNCBI" feature!
Thank you for answering my question!
You are having the same problem with the data from expasy.org. You only get a fraction of the file before you hit the timeout, which as you can see is only 1000 seconds. You need to bump that up by (probably) a factor of 10.
Hi,
I am now very clear that my problem is a network problem.
What's bother me is that all the files, including gene2pubmed.gz, gene2accession.gz, gene2refseq.gz, gene_info.gz, gene2go.gz and idmapping_selected.tab.gz, I was actually able to download them all individually using my browser.
What's preventing me from succeeding is simply that I don't have the ability to download them automatically within RStudio.
So, I'd like to ask, since I can download all files by myself, is there an option that tells the program to just use the files in specific folder?
I know that [rebuildCache = F] can prevent automatic file downloads, does this method work the first time [rebuild the cache]?
My code:
What I get:
I think I have all the packets for the [rebuild cache] step ready, but I still get the error as reported above. What should I do later?
Please help me!
Ah, figured it out. What is meant to happen is that you download files from NCBI and then parse them and put the data in an omnibus SQLite database, which includes the date you downloaded the files. If you do that, and then later want to build another
OrgDb
package, the omnibus SQLite database is checked to see when it was built, and if that was a day or more in the past, it will re-download the data.But the code that is used to populate the omnibus database was within an
if
statement that is triggered by the rebuildCache argument. If you say rebuildCache = FALSE, then the code to populate the omnibus database is skipped, and you then get the error you see. I have fixed this in both release and devel, which you can get by waiting for the package builder to build the package (e.g, in a day or two, runningBiocManager::install()
will get the updated version ofAnnotationForge
).Hi,
I got it done! really excited, thanks!
The process wasn't as easy as it seemed, and it still took me a full day to finish building the OrgDb package after updating to the new version you released. This is because files that appear to have finished downloading and are the correct size may not actually be complete, which shows how bad my network really is! Since the NCBI FTP site does not provide an MD5 checklist, I was never able to determine the integrity of the files, so the program would run with errors until I got the complete *.gz file.
Fortunately, this problem was eventually solved by using the wget CLI tool. If others have similar network problems, they can also refer to my experience.
Thanks again! Staphylococcus aureus is not a rare species, and building this OrgDb may not be of great practical importance, but it really means a lot to me personally. I would not have been able to complete this work at all without your help. I look forward to interacting with you again next time.
Sorry, I'm back again.
After the software update, I downloaded the required .gz file via wget and used the "makeOrgPackageFromNCBI" function to create the OrgDb package offline. The process went very smoothly and no errors or bugs were reported.
However, I found two problems: 1. the created OrgDb package is missing content; 2. the GID and GeneID provided in the .gff file cannot be matched. This causes the GO analysis to fail.
Question 1: The created OrgDb package is missing content. Take E.coli as an example. Compared to the standard K-12 OrgDb package downloaded from the bioconductor website, there are many "keytypes" missing. It is also possible that the OrgDb package we generate has less information than the official package, so I don't know if this is normal?
Question 2: Many columns in the package are "NA", such as "ENTREZID" and "GENENAME". Only the first row has data.
Question 3: GeneID from the .gff file does not match the GID column in the OrgDb I created. (But it can match with standard K-12 package perfectly)
I have no idea what the problem is this time, thanks for your attention and reply!
The process for generating the 'real'
OrgDb
packages is quite complex, and cannot easily be replicated as part of a package, so what you get by building your own will necessarily be a subset of what you could get from us.Searching the IDs in the package you built brings up lots of different species, and I cannot find many of those IDs in a gene2accession file that I just downloaded. So no idea what the problem is.
There are two existing E coli packages. And the K12 version perfectly matches your GFF. Why are you attempting to recreate something you can get already?
Hi,
Thank you for your response!
The pathogen I am studying is Staphylococcus aureus (taxid 1280), which cannot be downloaded from the Bioconductor website, and there are no annotations for this species in the AnnotationHub. The reason for using E. coli as an example for this question is that E. coli has the most data among bacteria and is more descriptive. If even E. coli is not working properly, it is not surprising that the OrgDb of other bacteria has a similar situation.
I found that in the case of E.coli, the OrgDb generated by the "offline" mode had too little information and too many "NA" to use the OrgDb. This should not be normal, could it be some kind of bug?
I downloaded all the .gz files again last night and rebuilt the OrgDb package for E.coli this morning, but nothing has changed. I think the problem is more consistent than occasional or random.
So I would like to ask you, if you have time, to try my method (tax_id = "562", rebuildCache = F) to see if there are also so many NA's that the analysis cannot continue?
I hope I've made myself clear, and thanks again for your attention and reply.
I don't know where all those other entries are coming from, and don't have the time right now to track it down. But there is essentially no information in the NCBI files for either E coli or S aureus, except for the gene_info file.
When you run
makeOrgPackageFromNCBI
, you first create a SQLite database that contains all the data, and then parse out the data for the taxonomic ID you are interested in. We can query that DB directly.That's the only entry for E coli! I can't find any of those other GIDs that are populating your
OrgDb
, and I suspect it's a bug. But long story short, I don't believe you will be able to generate anorgDb
for bacteria usingmakeOrgDbFromNCBI
, because the data don't appear to exist in the data you can get from them.Thank you for your reply, I will find another way to solve this matter.