Error MakeOrgPackagefromNCBI
1
0
Entering edit mode
@dfb033f2
Last seen 9 months ago
Finland

I need to create an orgDb for my microorganism, but it gives me an error that I'll report below:

>  > makeOrgPackageFromNCBI(version = "0.1",
> +                        author = "Cinzia Spagnoli cinzia.spagnoli@uniroma3.it",
> +                        maintainer = "Cinzia Spagnoli cinzia.spagnoli@uniroma3.it",
> +                        outputDir = ".",
> +                        tax_id = "575584",
> +                        genus = "Acinetobacter",
> +                        species = "baumannii")
>  If files are not cached locally this may take awhile to assemble a 33 
> GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.Please also see AnnotationHub for some pre-builtOrgDb downloads preparing data from NCBI ...
> starting download for
> [1] gene2pubmed.gz
> [2] gene2accession.gz
> [3] gene2refseq.gz
> [4] gene_info.gz
> [5] gene2go.gz
> getting data for gene2pubmed.gz
> extracting data for our organism from : gene2pubmed getting data for 
> gene2accession.gz extracting data for our organism from : 
> gene2accession getting data for gene2refseq.gz extracting data for our 
> organism from : gene2refseq getting data for gene_info.gz extracting 
> data for our organism from : gene_info getting data for gene2go.gz 
> extracting data for our organism from : gene2go processing gene2pubmed 
> processing gene_info: chromosomes processing gene_info: description 
> Error in prepareDataFromNCBI(tax_id, NCBIFilesDir, outputDir, rebuildCache,  :
>   no information found for species with tax id 575584
Bioconductor • 1.4k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 9 hours ago
United States

You will rarely find a particular strain in any annotation data, and instead you should use the 'main' taxon ID, which for A. baumannii happens to be 470.

## how many genes for 470?
$ awk '$1 == 470' gene_info | wc -l
3733
## now how about 575584
$ awk '$1 == 575584' gene_info | wc -l
0

No idea how many genes one might expect for this bacterium, but you will get better results using 470.

ADD COMMENT
0
Entering edit mode

I tried, but it does not seem to work.

> library(AnnotationForge)
> makeOrgPackageFromNCBI(version = "0.1",
+                          author = "Cinzia Spagnoli cinzia.spagnoli@uniroma3.it",
+                          maintainer = "Cinzia Spagnoli cinzia.spagnoli@uniroma3.it",
+                          outputDir = ".",
+                          tax_id = "470",
+                          genus = "Acinetobacter",
+                          species = "baumannii")
If files are not cached locally this may take awhile to assemble a 33 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.Please also see AnnotationHub for some pre-builtOrgDb downloads
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
rebuilding the cache
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
rebuilding the cache
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
rebuilding the cache
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
rebuilding the cache
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  error reading from the connection
In addition: Warning messages:
1: In .Internal(shortRowNames(x, type)) :
  closing unused connection 3 (D:/OneDrive - Universita degli Studi Roma Tre/Documenti/gene2pubmed.gz)
2: call dbDisconnect() when finished working with a connection 
3: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  invalid or incomplete compressed data
ADD REPLY
0
Entering edit mode

It might be due to either the spaces in your path, or the fact that it's a OneDrive directory. It's normally better to just use the Desktop and delete after installing.

> makeOrgPackageFromNCBI("0.0.1","me <me@mine.org>","me", tax_id = "470", genus = "Acinetobacter", species = "baumannii", rebuildCache = FALSE)
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
extracting data for our organism from : gene_info
getting data for gene2go.gz
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
making the OrgDb package ...
Populating genes table:
genes table filled
Populating pubmed table:
pubmed table filled
Populating gene_info table:
gene_info table filled
Populating entrez_genes table:
entrez_genes table filled
Populating alias table:
alias table filled
Populating refseq table:
refseq table filled
Populating accessions table:
accessions table filled
Populating go table:
go table filled
table metadata filled

'select()' returned many:1
mapping between keys and columns
Dropping GO IDs that are too new for the current GO.db
Populating go table:
go table filled
Populating go_bp table:
go_bp table filled
Populating go_cc table:
go_cc table filled
Populating go_mf table:
go_mf table filled
'select()' returned many:1
mapping between keys and columns
Populating go_bp_all table:
go_bp_all table filled
Populating go_cc_all table:
go_cc_all table filled
Populating go_mf_all table:
go_mf_all table filled
Populating go_all table:
go_all table filled
Creating package in c:/Users/jmacdon/Desktop/org.Abaumannii.eg.db 
Now deleting temporary database file
complete!
[1] "org.Abaumannii.eg.sqlite"

> install.packages("org.Abaumannii.eg.db", type = "source", repos = NULL)
Installing package into 'C:/Users/jmacdon/AppData/Local/R/win-library/4.3'
(as 'lib' is unspecified)
* installing *source* package 'org.Abaumannii.eg.db' ...
<snip>
* DONE (org.Abaumannii.eg.db)
> library(org.Abaumannii.eg.db)

> select(org.Abaumannii.eg.db, head(keys(org.Abaumannii.eg.db)), "SYMBOL")
'select()' returned 1:1 mapping
between keys and columns
       GID        SYMBOL
1 66395337          dnaA
2 66395338          dnaN
3 66395339          recF
4 66395340          gyrB
5 66395341          cybC
6 66395342 F3P16_RS00030

> sessionInfo()
R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)
ADD REPLY
0
Entering edit mode

I apologize for the delay in responding. However, the command still doesn't work for me. I would need to create the package from this genome: https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP058289.1/

ADD REPLY
0
Entering edit mode

I don't know what to tell you. I already told you that you can't build it for that strain, and you have to use 470 instead. I can get it to build (see above), and told you not to use a OneDrive path. Saying 'the command still doesn't work for me' without code or output isn't helpful at all (doesn't work how?).

ADD REPLY
0
Entering edit mode

Hello, while working on different projects, I recently got back to working on this code. I tried running your code again, but it's not working, even though I manually downloaded these files into the working directory: [1] gene2pubmed.gz [2] gene2accession.gz [3] gene2refseq.gz [4] gene_info.gz [5] gene2go.gz

Here the code:

> makeOrgPackageFromNCBI("0.0.1","me <me@mine.org>","me",
  • tax_id = "470",
  • genus = "Acinetobacter",
  • species = "baumannii") If files are not cached locally this may take awhile to assemble a 33 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.Please also see AnnotationHub for some pre-builtOrgDb downloads preparing data from NCBI ... starting download for [1] gene2pubmed.gz [2] gene2accession.gz [3] gene2refseq.gz [4] gene_info.gz [5] gene2go.gz getting data for gene2pubmed.gz extracting data for our organism from : gene2pubmed getting data for gene2accession.gz Error: no such table: gene2accession_date
ADD REPLY
0
Entering edit mode

That error indicates that you already have a file called NCBI.sqlite in your working directory, and it's incomplete (missing the gene2accession_date table). Here's mine:

> library(RSQLite)
Warning message:
package 'RSQLite' was built under R version 4.3.2 
> con <- dbConnect(SQLite(), "NCBI.sqlite")
> dbListTables(con)
 [1] "altGO"              
 [2] "altGO_date"         
 [3] "gene2accession"     
 [4] "gene2accession_date"
 [5] "gene2go"            
 [6] "gene2go_date"       
 [7] "gene2pubmed"        
 [8] "gene2pubmed_date"   
 [9] "gene2refseq"        
[10] "gene2refseq_date"   
[11] "gene_info"          
[12] "gene_info_date"     

## it's just a dumb little table that says when the db was built
> dbGetQuery(con, "select * from gene2accession_date;")
        date
1 2023-06-28

The easiest thing to do is to delete your NCBI.sqlite DB and then run makeOrgDbFromNCBI again. But do note that you have to add rebuildCache = FALSE to your call, or you will download all those files again!

ADD REPLY
0
Entering edit mode

Well, thanks for the advice! i resubmitted the program and this is what I get. Fortunately, it downloaded most of the .gz files for me, but it crashed at GO.

> makeOrgPackageFromNCBI(version = "0.0.1",
  • author = "Cinzia Spagnoli cinzia.spagnoli@uniroma3.it",
  • maintainer = "Cinzia Spagnoli cinzia.spagnoli@uniroma3.it",
  • outputDir = '.',
  • tax_id = "470",
  • genus = "Acinetobacter",
  • species = "baumannii",
  • ) If files are not cached locally this may take awhile to assemble a 33 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.Please also see AnnotationHub for some pre-builtOrgDb downloads preparing data from NCBI ... starting download for [1] gene2pubmed.gz [2] gene2accession.gz [3] gene2refseq.gz [4] gene_info.gz [5] gene2go.gz getting data for gene2pubmed.gz rebuilding the cache extracting data for our organism from : gene2pubmed getting data for gene2accession.gz rebuilding the cache extracting data for our organism from : gene2accession getting data for gene2refseq.gz rebuilding the cache extracting data for our organism from : gene2refseq getting data for gene_info.gz rebuilding the cache extracting data for our organism from : gene_info getting data for gene2go.gz rebuilding the cache extracting data for our organism from : gene2go processing gene2pubmed processing gene_info: chromosomes processing gene_info: description processing alias data processing refseq data processing accession data processing GO data Error in download.file(url, dest, quiet = TRUE) : download from 'https://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz' failed In addition: Warning messages: 1: In download.file(url, dest, quiet = TRUE) : downloaded length 0 != reported length 0 2: In download.file(url, dest, quiet = TRUE) : URL 'https://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz': Timeout of 1000 seconds was reached

con <- dbConnect(SQLite(), "NCBI.sqlite") dbListTables(con) [1] "gene2accession" "gene2accession_date" "gene2go" "gene2go_date"
[5] "gene2pubmed" "gene2pubmed_date" "gene2refseq" "gene2refseq_date"
[9] "gene_info" "gene_info_date"

ADD REPLY
0
Entering edit mode

Two remarks:

You can increase the time out further to (for example) 4000 seconds through options(timeout = 4000).

Although not related to the time out, note that you did not correctly follow the naming convention for specifying the author and maintainer; the mail should be between < and >. See ?makeOrgPackageFromNCBI, that tells you to do that like this:

author = "Some One <so@someplace.org>",
maintainer = "Some One <so@someplace.org>",
ADD REPLY
0
Entering edit mode

I did it! After several attempts and your invaluable advice! Now I'm trying to perform GO and KEGG enrichment analysis. I also followed the advice given in this link: "No genes can be mapped...." using enrichGO in clusterProfiler But it does not work, can you help me?

GO classification

all_genes <- read.csv("all_genes.csv") diff_genes <- read.csv("diff_genes.csv")

GO_analysis <- enrichGO(gene = diff_genes,

  • universe = all_genes,
  • OrgDb = org.Abaumannii.eg.db,
  • ont = "CC", # either "BP", "CC" or "MF",
  • pAdjustMethod = "none",
  • pvalueCutoff = 1,
  • qvalueCutoff = 1,
  • readable = TRUE,
  • pool = TRUE) --> No gene can be mapped.... --> Expected input gene ID: 66397190,66398467,66396010,66395397,66398543,66398024 --> return NULL...
ADD REPLY
1
Entering edit mode

Happy to hear you got it working!

Please open a new thread for your new question. Yet, before you do so, double-check that the object/input GO_analysis is a character vector, and that these are indeed entrez ids (and thus match with the reported Expected input gene IDs). In other words, first do all the checks that I suggested in my post you linked to...!

ADD REPLY

Login before adding your answer.

Traffic: 822 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6