Question

Error for AnnotationForge makeOrgPackageFromNCBI function

0

Entering edit mode

Gayatri • 0

@1e961e20

Last seen 7 days ago

India

Hello everyone,

I am PhD student working on performing GSEA analysis for Candida albicans data. I queried AnnotationHub for existing records and found none. Hence I am trying to make an Organism DB for C. albicans. After going through the threads of Problem making orgdb package for bacteria (Pseudomonas) using annotation hub and annotation forge; error with downloading NCBI data for makeorgpackagefromNCBI? ; AnnotationForge not working for building custom org packages; I am still encountering the following errors. I have tried downloading the files directly from https://ftp.ncbi.nlm.nih.gov/gene/DATA/ after deleting the NCBI.sqlite file, but to no avail. I even tried changing the timeout settings to 10000. Any help in this regard is highly appreciated.


> hub <- AnnotationHub()
  |=====================================================================================| 100%

snapshotDate(): 2023-10-23
> query(hub, c("OrgDb","Candida albicans"))
AnnotationHub with 0 records
# snapshotDate(): 2023-10-23

> getOption('timeout')
[1] 60
> options(timeout = 10000)
> getOption('timeout')
[1] 10000

> makeOrgPackageFromNCBI("0.1", 
+                        "Gayatri <gayatri@catg.edu.in>", 
+                        "Gayatri", 
+                        ".", 
+                        "237561", 
+                        "Candida", 
+                        "albicans", 
+                        rebuildCache = FALSE)
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
Error: no such table: main.gene2pubmed

> list.files()
[1] "gene_info.gz"      "gene2accession.gz" "gene2go.gz"        "gene2pubmed.gz"   
[5] "gene2refseq.gz" 

> sessionInfo()
R version 4.3.3 (2024-02-29 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Asia/Calcutta
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] AnnotationForge_1.44.0 biomaRt_2.58.2         AnnotationHub_3.10.1  
 [4] BiocFileCache_2.10.2   dbplyr_2.5.0           GenomeInfoDb_1.38.8   
 [7] AnnotationDbi_1.64.1   IRanges_2.36.0         S4Vectors_0.40.2      
[10] Biobase_2.62.0         BiocGenerics_0.48.1   

loaded via a namespace (and not attached):
 [1] KEGGREST_1.42.0               vctrs_0.6.5                   tools_4.3.3                  
 [4] bitops_1.0-7                  generics_0.1.3                curl_5.2.1                   
 [7] tibble_3.2.1                  fansi_1.0.6                   RSQLite_2.3.6                
[10] blob_1.2.4                    pkgconfig_2.0.3               lifecycle_1.0.4              
[13] GenomeInfoDbData_1.2.11       compiler_4.3.3                stringr_1.5.1                
[16] Biostrings_2.70.3             progress_1.2.3                httpuv_1.6.15                
[19] htmltools_0.5.8.1             RCurl_1.98-1.14               yaml_2.3.8                   
[22] interactiveDisplayBase_1.40.0 pillar_1.9.0                  later_1.3.2                  
[25] crayon_1.5.2                  cachem_1.0.8                  mime_0.12                    
[28] tidyselect_1.2.1              digest_0.6.35                 stringi_1.8.3                
[31] dplyr_1.1.4                   BiocVersion_3.18.1            fastmap_1.1.1                
[34] cli_3.6.2                     magrittr_2.0.3                XML_3.99-0.16.1              
[37] utf8_1.2.4                    prettyunits_1.2.0             filelock_1.0.3               
[40] promises_1.3.0                rappdirs_0.3.3                bit64_4.0.5                  
[43] XVector_0.42.0                httr_1.4.7                    bit_4.0.5                    
[46] png_0.1-8                     hms_1.1.3                     memoise_2.0.1                
[49] shiny_1.8.1.1                 rlang_1.1.3                   Rcpp_1.0.12                  
[52] xtable_1.8-4                  glue_1.7.0                    DBI_1.2.2                    
[55] xml2_1.3.6                    BiocManager_1.30.22           R6_2.5.1                     
[58] zlibbioc_1.48.2              
Warning message:
call dbDisconnect() when finished working with a connection

OrgDb AnnotationForge • 288 views

ADD COMMENT • link updated 9 days ago by James W. MacDonald 65k • written 12 days ago by Gayatri • 0

score 0 · Answer 1 · 2024-04-22

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 29 minutes ago

United States

That error incidates that your NBCI.sqlite database is missing the gene2pubmed table, so you should regenerate that db. Just delete it and then rerun the script as you already have.

ADD COMMENT • link 11 days ago James W. MacDonald 65k

0

Entering edit mode

Hello James,

How to generate the db? By running the makeOrgPackageFromNCBI() command? I have tried doing that, first by just deleting the NCBI.sqlite file and running the command; and then deleting both the NCBI.sqlite as well as gene2pubmed file as the size of the file is around 180 Mb. But then I am still getting the error that gene2accession file is partially transferred as that too is re-downloaded, even though it is already downloaded prior to running the command.

ADD REPLY • link 10 days ago Gayatri • 0

0

Entering edit mode

I don't really follow what you are saying. All you have to do is delete the NCBI.sqlite db and re-run the script exactly as you did above. If you say rebuildCache = FALSE you shouldn't download anything. And the error you got before didn't say anything about downloading files. It said that you were missing the gene2pubmed table.

ADD REPLY • link 10 days ago James W. MacDonald 65k

0

Entering edit mode

What I meant is even after deleting the NCBI.sqlite file, and re-running the script, an empty NCBI.sqlite file (0 kb) is created which is causing the error I mentioned:

preparing data from NCBI ... starting download for [1] gene2pubmed.gz [2] gene2accession.gz [3] gene2refseq.gz [4] gene_info.gz [5] gene2go.gz getting data for gene2pubmed.gz Error: no such table: main.gene2pubmed

I ensured any partially created NCBI.sqlite files are deleted, then downloaded the data directly from the NCBI, and then only re-ran the script. But the creation of this empty NCBI. sqlite file is causing the script to terminate. What do you suggest I do regarding this?

ADD REPLY • link 9 days ago Gayatri • 0

0

Entering edit mode

That's weird. I don't have any problem at all generating the OrgDb on my box. How big is the gene2pubmed.gz file? I get this:

gzip -dc gene2pubmed.gz | wc -l
57054066

So just over 57M rows. I get fewer from the NCBI.sqlite file, but it's definitely there.

> library(RSQLite)
Warning message:
package 'RSQLite' was built under R version 4.3.2 
> con <- dbConnect(SQLite(), "NCBI.sqlite")
> dbGetQuery(con, "select count(*) from gene2pubmed;") 
  count(*)
1  4843195

ADD REPLY • link 9 days ago James W. MacDonald 65k