Question

AnnotationForge::makeOrgPackageFromNCBI - ERROR to access url (NCBI FTP site)

0

Entering edit mode

Mistinrain ▴ 10

@d97f3ccd

Last seen 13 months ago

United States

Hi, I am trying to make an Org package from NCBI database. However, I met an URL access ERROR when running AnnotationForge::makeOrgPackageFromNCBI

My code is:

> makeOrgPackageFromNCBI(version = "0.1",
+                        author = "Chang Liu <liuchangbio@163.com>",
+                        maintainer = "Chang Liu <liuchangbio@163.com>",
+                        outputDir = ".",
+                        NCBIFilesDir = getwd(),
+                        tax_id = "703339", #金黄色葡萄球菌
+                        genus = "Staphycoloccus",
+                        species = "aureus")

Here is the ERROR report:

If files are not cached locally this may take awhile to assemble a 33 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.Please also see AnnotationHub for some pre-builtOrgDb downloads
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
rebuilding the cache
Error in .tryDL(url, tmp) : url access failed after
4
attempts; url:
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz

AnnotationForge • 2.1k views

ADD COMMENT • link 14 months ago • updated 13 months ago Mistinrain ▴ 10

score 0 · Answer 1 · 2023-01-31

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 15 hours ago

United States

Try setting options(timeout = 5000)

ADD COMMENT • link 14 months ago James W. MacDonald 65k

0

Entering edit mode

Hi James,

Thank you very much for your response. I have done as you suggested, but it is still not work with same ERROR code.

To test this error, I used the official document code, but it still cannot connect.

I am confused about this problem. Because my network is OK, and "makeOrgPackageFromNCBI" is also the official recommended method, this error seems difficult to understand。

Here is my code:

> AnnotationForge::makeOrgPackageFromNCBI(version = "0.1",
+                        author = "Chang Liu <liuchangbio@163.com>",
+                        maintainer = "Chang Liu <liuchangbio@163.com>",
+                        outputDir = ".",
+                        NCBIFilesDir = getwd(),
+                        tax_id = "703339", 
+                        genus = "Staphycoloccus",
+                        species = "aureus",
+                        options(timeout = 5000))

If files are not cached locally this may take awhile to assemble a 33 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.Please also see AnnotationHub for some pre-builtOrgDb downloads
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
rebuilding the cache
Error in .tryDL(url, tmp) : url access failed after
4
attempts; url:
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz

Thank you for your attention :)

ADD REPLY • link 14 months ago Mistinrain ▴ 10

0

Entering edit mode

AFAIK you should set options(timeout = 5000) before you call the function makeOrgPackageFromNCBI, and not use it as argument within makeOrgPackageFromNCBI...

On Windows the default maximum request time is 60 seconds, and you thus overrule this value.

> ## check default setting
> getOption('timeout')
[1] 60
> ## increase value
> options(timeout = 5000)
> getOption('timeout')
[1] 5000
>
> ## now run makeOrgPackageFromNCBI()
> makeOrgPackageFromNCBI(*<<with all arguments you used in your first post>>*)
>

ADD REPLY • link 14 months ago Guido Hooiveld ★ 3.9k

0

Entering edit mode

Hi,

Thank you for your reply.

I have tried to follow the advice you gave, but it still does not work. This does not look like an easy ERROR to fix.

I did a search of the bioconductor forum and no one seemed to be asking the same question.

Same question was asked on github, but it wasn't solved either.

https://github.com/Bioconductor/AnnotationForge/issues/40

Anyway, thank you for your advice. Thank you very much!

> getOption('timeout')
[1] 5000
> options(timeout = 5000)
> getOption('timeout')
[1] 5000

> AnnotationForge:: makeOrgPackageFromNCBI(version = "0.1",
+                        author = "Some One <so@someplace.org>",
+                        maintainer = "Some One <so@someplace.org>",
+                        outputDir = ".",
+                        tax_id = "59729",
+                        genus = "Taeniopygia",
+                        species = "guttata",
+                        NCBIFilesDir = getwd()
+                        )


If files are not cached locally this may take awhile to assemble a 33 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.Please also see AnnotationHub for some pre-builtOrgDb downloads
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
rebuilding the cache
Error in .tryDL(url, tmp) : url access failed after
4
attempts; url:
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz

ADD REPLY • link 14 months ago Mistinrain ▴ 10

1

Entering edit mode

Does this work?

> tmp <- tempfile("whatevs")
> download.file("ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz", tmp, mode = "wb")
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz'
Content type 'unknown' length 64119282 bytes (61.1 MB)
> file.exists(tmp)
[1] TRUE

It looks like you might be blocked by a firewall.

ADD REPLY • link 14 months ago James W. MacDonald 65k

0

Entering edit mode

Your answer captures the essence of the problem.

At first it did not work, indicating that there was a problem with my network. After improving my network conditions, this issue was resolved, But I have to admit that it's very slow and inconsistent.

> download.file("ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz", tmp, mode = "wb")
  trying URL 'ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz'
  Content type 'unknown' length 66163513 bytes (63.1 MB)
  ==================================================
> file.exists(tmp)
  [1] TRUE

And naturally, several subsequent steps were successfully performed, but eventually it stopped at a new problem.

This is my code:

> AnnotationForge::makeOrgPackageFromNCBI(version = "0.1",
+                        author = "Chang Liu <liuchangbio@163.com>",
+                        maintainer = "Chang Liu <liuchangbio@163.com>",
+                        outputDir = ".",
+                        NCBIFilesDir = getwd(),
+                        tax_id = "1280", #Staphycoloccus aureus
+                        genus = "Staphycoloccus",
+                        species = "aureus"
+                        )

What I get:

If files are not cached locally this may take awhile to assemble a 33 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.Please also see AnnotationHub for some pre-builtOrgDb downloads
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
extracting data for our organism from : gene_info
getting data for gene2go.gz
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Error in download.file(url, dest, quiet = TRUE) : 
  download from 'https://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz' failed
In addition: Warning messages:
1: In download.file(url, dest, quiet = TRUE) :
  downloaded length 35263956 != reported length 11157848349
2: In download.file(url, dest, quiet = TRUE) :
  URL 'https://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz': Timeout of 1000 seconds was reached

In case the problem is due to an occasional network problem, I've tried many times on different networks and haven't found a solution yet. I'll keep trying until it's solved because I really want to use the “ "makeOrgPackageFromNCBI" feature!

Thank you for answering my question!

ADD REPLY • link 14 months ago Mistinrain ▴ 10

0

Entering edit mode

You are having the same problem with the data from expasy.org. You only get a fraction of the file before you hit the timeout, which as you can see is only 1000 seconds. You need to bump that up by (probably) a factor of 10.

ADD REPLY • link 14 months ago James W. MacDonald 65k

0

Entering edit mode

Hi,

I am now very clear that my problem is a network problem.

What's bother me is that all the files, including gene2pubmed.gz, gene2accession.gz, gene2refseq.gz, gene_info.gz, gene2go.gz and idmapping_selected.tab.gz, I was actually able to download them all individually using my browser.

What's preventing me from succeeding is simply that I don't have the ability to download them automatically within RStudio.

So, I'd like to ask, since I can download all files by myself, is there an option that tells the program to just use the files in specific folder?

I know that [rebuildCache = F] can prevent automatic file downloads, does this method work the first time [rebuild the cache]?

My code:

> AnnotationForge::makeOrgPackageFromNCBI(version = "0.1",
+                        author = "Chang Liu <liuchangbio@163.com>",
+                        maintainer = "Chang Liu <liuchangbio@163.com>",
+                        outputDir = ".",
+                        NCBIFilesDir = "/Volumes/ROG2T/2023-02_GO_DATA/",
+                        tax_id = "1280", #Staphycoloccus aureus
+                        genus = "Staphycoloccus",
+                        species = "aureus",
+                        rebuildCache = F
+                        )

What I get:

preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
Error: no such table: main.gene2pubmed

files I think I have all the packets for the [rebuild cache] step ready, but I still get the error as reported above. What should I do later?

Please help me!

ADD REPLY • link 14 months ago Mistinrain ▴ 10

0

Entering edit mode

Ah, figured it out. What is meant to happen is that you download files from NCBI and then parse them and put the data in an omnibus SQLite database, which includes the date you downloaded the files. If you do that, and then later want to build another OrgDb package, the omnibus SQLite database is checked to see when it was built, and if that was a day or more in the past, it will re-download the data.

But the code that is used to populate the omnibus database was within an if statement that is triggered by the rebuildCache argument. If you say rebuildCache = FALSE, then the code to populate the omnibus database is skipped, and you then get the error you see. I have fixed this in both release and devel, which you can get by waiting for the package builder to build the package (e.g, in a day or two, running BiocManager::install() will get the updated version of AnnotationForge).

ADD REPLY • link 14 months ago James W. MacDonald 65k

0

Entering edit mode

Hi，

I got it done！ really excited, thanks!

The process wasn't as easy as it seemed, and it still took me a full day to finish building the OrgDb package after updating to the new version you released. This is because files that appear to have finished downloading and are the correct size may not actually be complete, which shows how bad my network really is! Since the NCBI FTP site does not provide an MD5 checklist, I was never able to determine the integrity of the files, so the program would run with errors until I got the complete *.gz file.

Fortunately, this problem was eventually solved by using the wget CLI tool. If others have similar network problems, they can also refer to my experience.

Thanks again! Staphylococcus aureus is not a rare species, and building this OrgDb may not be of great practical importance, but it really means a lot to me personally. I would not have been able to complete this work at all without your help. I look forward to interacting with you again next time.

preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
extracting data for our organism from : gene_info
getting data for gene2go.gz
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Please be patient while we work out which organisms can be annotated
  with ensembl IDs.
making the OrgDb package ...
Populating genes table:
genes table filled
Populating gene_info table:
gene_info table filled
Populating entrez_genes table:
entrez_genes table filled
Populating alias table:
alias table filled
Populating go table:
go table filled
table metadata filled

'select()' returned many:1 mapping between keys and columns
Dropping GO IDs that are too new for the current GO.db
Populating go table:
go table filled
Populating go_bp table:
go_bp table filled
Populating go_cc table:
go_cc table filled
Populating go_mf table:
go_mf table filled
'select()' returned many:1 mapping between keys and columns
Populating go_bp_all table:
go_bp_all table filled
Populating go_cc_all table:
go_cc_all table filled
Populating go_mf_all table:
go_mf_all table filled
Populating go_all table:
go_all table filled
Creating package in ./org.Saureus.eg.db 
Now deleting temporary database file
complete!
[1] "org.Saureus.eg.sqlite"

ADD REPLY • link 14 months ago Mistinrain ▴ 10

0

Entering edit mode

Sorry, I'm back again.

After the software update, I downloaded the required .gz file via wget and used the "makeOrgPackageFromNCBI" function to create the OrgDb package offline. The process went very smoothly and no errors or bugs were reported.

However, I found two problems: 1. the created OrgDb package is missing content; 2. the GID and GeneID provided in the .gff file cannot be matched. This causes the GO analysis to fail.

Question 1: The created OrgDb package is missing content. Take E.coli as an example. Compared to the standard K-12 OrgDb package downloaded from the bioconductor website, there are many "keytypes" missing. It is also possible that the OrgDb package we generate has less information than the official package, so I don't know if this is normal?

> keytypes(org.EcK12.eg.db)
 [1] "ACCNUM"      "ALIAS"       "ENTREZID"    "ENZYME"      "EVIDENCE"    "EVIDENCEALL"
 [7] "GENENAME"    "GO"          "GOALL"       "ONTOLOGY"    "ONTOLOGYALL" "PATH"       
[13] "PMID"        "REFSEQ"      "SYMBOL"     
> keytypes(org.Ecoli.eg.db)
 [1] "ALIAS"       "ENTREZID"    "EVIDENCE"    "EVIDENCEALL" "GENENAME"    "GID"        
 [7] "GO"          "GOALL"       "ONTOLOGY"    "ONTOLOGYALL" "SYMBOL"

Question 2: Many columns in the package are "NA", such as "ENTREZID" and "GENENAME". Only the first row has data.

K_12_OrgDb_ID <- AnnotationDbi::select(org.EcK12.eg.db, 
                                      keys = keys(org.EcK12.eg.db),
                                      columns = c('GENENAME','ENTREZID','GO'))

enter image description here

Ecoli_OrgDb_myself <- AnnotationDbi::select(org.Ecoli.eg.db, 
                                            keys = keys(org.Ecoli.eg.db),
                                            columns = c('GID','ENTREZID','GO', 'GENENAME'))

enter image description here

Question 3: GeneID from the .gff file does not match the GID column in the OrgDb I created. (But it can match with standard K-12 package perfectly)

K_12.gff <- read.csv(file = "Questino/Ecoli_geneID.txt",header = T) %>% 
                    pull(K.12_gff_Gene_ID) %>% sort()

K_12_OrgDb_ID <- AnnotationDbi::select(org.EcK12.eg.db, 
                                      keys = keys(org.EcK12.eg.db),
                                      columns = c('GENENAME','ENTREZID','GO')) %>% 
                                      pull(ENTREZID) %>% sort()

Ecoli_OrgDb_myself <- AnnotationDbi::select(org.Ecoli.eg.db, 
                                            keys = keys(org.Ecoli.eg.db),
                                            columns = c('GID','ENTREZID','GO', 'GENENAME')) %>% 
                                            pull(GID) %>% sort()




> sum(K_12.gff %in% K_12_OrgDb_ID)
[1] 4167
> sum(K_12.gff %in% Ecoli_OrgDb_myself)
[1] 0

I have no idea what the problem is this time, thanks for your attention and reply!

ADD REPLY • link 13 months ago Mistinrain ▴ 10

0

Entering edit mode

The process for generating the 'real' OrgDb packages is quite complex, and cannot easily be replicated as part of a package, so what you get by building your own will necessarily be a subset of what you could get from us.

Searching the IDs in the package you built brings up lots of different species, and I cannot find many of those IDs in a gene2accession file that I just downloaded. So no idea what the problem is.

There are two existing E coli packages. And the K12 version perfectly matches your GFF. Why are you attempting to recreate something you can get already?

ADD REPLY • link 13 months ago James W. MacDonald 65k

0

Entering edit mode

Hi,

Thank you for your response!

The pathogen I am studying is Staphylococcus aureus (taxid 1280), which cannot be downloaded from the Bioconductor website, and there are no annotations for this species in the AnnotationHub. The reason for using E. coli as an example for this question is that E. coli has the most data among bacteria and is more descriptive. If even E. coli is not working properly, it is not surprising that the OrgDb of other bacteria has a similar situation.

I found that in the case of E.coli, the OrgDb generated by the "offline" mode had too little information and too many "NA" to use the OrgDb. This should not be normal, could it be some kind of bug?

I downloaded all the .gz files again last night and rebuilt the OrgDb package for E.coli this morning, but nothing has changed. I think the problem is more consistent than occasional or random.

So I would like to ask you, if you have time, to try my method (tax_id = "562", rebuildCache = F) to see if there are also so many NA's that the analysis cannot continue?

I hope I've made myself clear, and thanks again for your attention and reply.

ADD REPLY • link 13 months ago Mistinrain ▴ 10

0

Entering edit mode

I don't know where all those other entries are coming from, and don't have the time right now to track it down. But there is essentially no information in the NCBI files for either E coli or S aureus, except for the gene_info file.

When you run makeOrgPackageFromNCBI, you first create a SQLite database that contains all the data, and then parse out the data for the taxonomic ID you are interested in. We can query that DB directly.

> con <- dbConnect(SQLite(), "NCBI.sqlite")
> dbGetQuery(con, "select * from gene_info where tax_id='562' limit 10;")
  tax_id gene_id   symbol locus_tag synonyms dbXrefs chromosome map_location
1    562 2827929 NEWENTRY         -        -       -          -            -
                                                                                                                                                                                                                                                                                   description
1 Record to support submission of GeneRIFs for a gene not in Gene (Bacillus coli; Bacterium coli; Bacterium coli commune; E. coli; Enterococcus coli; Escherichia/Shigella coli.  Use when strain, subtype, isolate, etc. is unspecified, or when different from all specified ones in Gene.).
  gene_type nomenclature_symbol nomenclature_name nomenclature_status
1     other                   -                 -                   -
  other_designations modification_date feature_type
1                  -          20230204            -

That's the only entry for E coli! I can't find any of those other GIDs that are populating your OrgDb, and I suspect it's a bug. But long story short, I don't believe you will be able to generate an orgDb for bacteria using makeOrgDbFromNCBI, because the data don't appear to exist in the data you can get from them.