Question

makeOrgPackageFromNCBI generates an empty database?

0

Entering edit mode

dissikratzl • 0

@7337162e

Last seen 12 weeks ago

Germany

Hi Bioconductor community,

I am trying to build an Org.db for a "not-so" model organism that lacks an entry in AnnotationHub(). For this I used makeOrgPackageFromNCBI, and although it took a very long time, it successfully downloaded the following files: [1] gene2pubmed.gz [2] gene2accession.gz [3] gene2refseq.gz [4] gene_info.gz [5] gene2go.gz and further created the organism package (With namespace, description, zzz.R file, sqlite file). I was able to load the package it into R but checking the database I got an error.

The command columns(org.Pputida.eg.db) returns a reasonable table (?):

[1] "ALIAS" "ENTREZID" "EVIDENCE" "EVIDENCEALL" [5] "GENENAME" "GID" "GO" "GOALL" [9] "ONTOLOGY" "ONTOLOGYALL" "SYMBOL"

While I can find the GO accession numbers in the database, every other column returns a single entry of "NaN." For me it seems that something went wrong during the database generation process?

I hope someone can give a tip on how to proceed. Thanks a lot in advance.

Dissi Kratzl


makeOrgPackageFromNCBI(version = "0.1",
                       author = "Kratzl_Dissi <Dissikratzl@gmail.de>",
                       maintainer = "Kratzl_Dissi <Dissikratzl@gmail.de>",
                       outputDir = ".",
                       NCBIFilesDir = ".",
                       tax_id = "160488",
                       genus = "Pseudomonas",
                       species = "putida", 

                       )


 sessionInfo()
[1] LC_COLLATE=German_Germany.1252 
[2] LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats4    stats     graphics  grDevices utils    
[6] datasets  methods   base     

other attached packages:
 [1] org.Pputida.eg.db_0.1  AnnotationHub_3.2.2   
 [3] BiocFileCache_2.2.1    dbplyr_2.2.1          
 [5] clusterProfiler_4.2.2  devtools_2.4.5        
 [7] usethis_2.2.2          shiny_1.8.0           
 [9] biomaRt_2.50.3         GenomeInfoDb_1.30.1   
[11] AnnotationForge_1.36.0 AnnotationDbi_1.56.2  
[13] IRanges_2.28.0         S4Vectors_0.32.4      
[15] Biobase_2.54.0         BiocGenerics_0.40.0

OrganismDbi makeOrgPackageFromNCBI • 327 views

ADD COMMENT • link updated 3 months ago by James W. MacDonald 65k • written 3 months ago by dissikratzl • 0

score 0 · Answer 1 · 2024-01-16

After you download all those files, the next step is to generate an omnibus SQLite database that is used to create the OrgDb package. This SQLite file is called NCBI.sqlite. If I query the one I have in hand, I get this:

> library(RSQLite)
> con <- dbConnect(SQLite(), "NCBI.sqlite")
> dbGetQuery(con, "select * from gene_info where tax_id='160488';")
  tax_id gene_id   symbol locus_tag
1 160488 2830333 NEWENTRY         -
  synonyms dbXrefs chromosome
1        -       -          -
  map_location
1            -
                                                                                                                            description
1 Record to support submission of GeneRIFs for a gene not in Gene (Pseudomonas putida (strain KT2440); Pseudomonas putida str. KT2440).
  gene_type nomenclature_symbol
1     other                   -
  nomenclature_name
1                 -
  nomenclature_status
1                   -
  other_designations
1                  -
  modification_date feature_type
1          20230125            -

Which indicates that there is only a placeholder for this particular strain. But there are other strains that do have genes, such as Pseudomonas putida NBRC 14164

> dbGetQuery(con, "select count(*) from gene_info where tax_id='1211579';")
  count(*)
1     5556

If that strain is close enough, you can make an OrgDb package by using its taxonomic ID instead. Do note that by default you will do the whole download/create step if it's been 24 hours since you ran the code last. It's completely unnecessary to do that (and boring besides), so you should A) use the same working directory that has the existing NCBI.sqlite file in it, and B) include a rebuildCache = FALSE in your call to makeOrgDbFromNCBI. In that scenario you will just query the DB to get the data you need and it should not take as much time.