Error for AnnotationForge makeOrgPackageFromNCBI function
1
0
Entering edit mode
Gayatri • 0
@1e961e20
Last seen 7 days ago
India

Hello everyone,

I am PhD student working on performing GSEA analysis for Candida albicans data. I queried AnnotationHub for existing records and found none. Hence I am trying to make an Organism DB for C. albicans. After going through the threads of Problem making orgdb package for bacteria (Pseudomonas) using annotation hub and annotation forge; error with downloading NCBI data for makeorgpackagefromNCBI? ; AnnotationForge not working for building custom org packages; I am still encountering the following errors. I have tried downloading the files directly from https://ftp.ncbi.nlm.nih.gov/gene/DATA/ after deleting the NCBI.sqlite file, but to no avail. I even tried changing the timeout settings to 10000. Any help in this regard is highly appreciated.


> hub <- AnnotationHub()
  |=====================================================================================| 100%

snapshotDate(): 2023-10-23
> query(hub, c("OrgDb","Candida albicans"))
AnnotationHub with 0 records
# snapshotDate(): 2023-10-23

> getOption('timeout')
[1] 60
> options(timeout = 10000)
> getOption('timeout')
[1] 10000

> makeOrgPackageFromNCBI("0.1", 
+                        "Gayatri <gayatri@catg.edu.in>", 
+                        "Gayatri", 
+                        ".", 
+                        "237561", 
+                        "Candida", 
+                        "albicans", 
+                        rebuildCache = FALSE)
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
Error: no such table: main.gene2pubmed

> list.files()
[1] "gene_info.gz"      "gene2accession.gz" "gene2go.gz"        "gene2pubmed.gz"   
[5] "gene2refseq.gz" 

> sessionInfo()
R version 4.3.3 (2024-02-29 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Asia/Calcutta
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] AnnotationForge_1.44.0 biomaRt_2.58.2         AnnotationHub_3.10.1  
 [4] BiocFileCache_2.10.2   dbplyr_2.5.0           GenomeInfoDb_1.38.8   
 [7] AnnotationDbi_1.64.1   IRanges_2.36.0         S4Vectors_0.40.2      
[10] Biobase_2.62.0         BiocGenerics_0.48.1   

loaded via a namespace (and not attached):
 [1] KEGGREST_1.42.0               vctrs_0.6.5                   tools_4.3.3                  
 [4] bitops_1.0-7                  generics_0.1.3                curl_5.2.1                   
 [7] tibble_3.2.1                  fansi_1.0.6                   RSQLite_2.3.6                
[10] blob_1.2.4                    pkgconfig_2.0.3               lifecycle_1.0.4              
[13] GenomeInfoDbData_1.2.11       compiler_4.3.3                stringr_1.5.1                
[16] Biostrings_2.70.3             progress_1.2.3                httpuv_1.6.15                
[19] htmltools_0.5.8.1             RCurl_1.98-1.14               yaml_2.3.8                   
[22] interactiveDisplayBase_1.40.0 pillar_1.9.0                  later_1.3.2                  
[25] crayon_1.5.2                  cachem_1.0.8                  mime_0.12                    
[28] tidyselect_1.2.1              digest_0.6.35                 stringi_1.8.3                
[31] dplyr_1.1.4                   BiocVersion_3.18.1            fastmap_1.1.1                
[34] cli_3.6.2                     magrittr_2.0.3                XML_3.99-0.16.1              
[37] utf8_1.2.4                    prettyunits_1.2.0             filelock_1.0.3               
[40] promises_1.3.0                rappdirs_0.3.3                bit64_4.0.5                  
[43] XVector_0.42.0                httr_1.4.7                    bit_4.0.5                    
[46] png_0.1-8                     hms_1.1.3                     memoise_2.0.1                
[49] shiny_1.8.1.1                 rlang_1.1.3                   Rcpp_1.0.12                  
[52] xtable_1.8-4                  glue_1.7.0                    DBI_1.2.2                    
[55] xml2_1.3.6                    BiocManager_1.30.22           R6_2.5.1                     
[58] zlibbioc_1.48.2              
Warning message:
call dbDisconnect() when finished working with a connection
OrgDb AnnotationForge • 340 views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 5 hours ago
United States

That error incidates that your NBCI.sqlite database is missing the gene2pubmed table, so you should regenerate that db. Just delete it and then rerun the script as you already have.

0
Entering edit mode

Hello James,

How to generate the db? By running the makeOrgPackageFromNCBI() command? I have tried doing that, first by just deleting the NCBI.sqlite file and running the command; and then deleting both the NCBI.sqlite as well as gene2pubmed file as the size of the file is around 180 Mb. But then I am still getting the error that gene2accession file is partially transferred as that too is re-downloaded, even though it is already downloaded prior to running the command.

ADD REPLY
0
Entering edit mode

I don't really follow what you are saying. All you have to do is delete the NCBI.sqlite db and re-run the script exactly as you did above. If you say rebuildCache = FALSE you shouldn't download anything. And the error you got before didn't say anything about downloading files. It said that you were missing the gene2pubmed table.

ADD REPLY
0
Entering edit mode

What I meant is even after deleting the NCBI.sqlite file, and re-running the script, an empty NCBI.sqlite file (0 kb) is created which is causing the error I mentioned:

preparing data from NCBI ... starting download for [1] gene2pubmed.gz [2] gene2accession.gz [3] gene2refseq.gz [4] gene_info.gz [5] gene2go.gz getting data for gene2pubmed.gz Error: no such table: main.gene2pubmed

I ensured any partially created NCBI.sqlite files are deleted, then downloaded the data directly from the NCBI, and then only re-ran the script. But the creation of this empty NCBI. sqlite file is causing the script to terminate. What do you suggest I do regarding this?

ADD REPLY
0
Entering edit mode

That's weird. I don't have any problem at all generating the OrgDb on my box. How big is the gene2pubmed.gz file? I get this:

gzip -dc gene2pubmed.gz | wc -l
57054066

So just over 57M rows. I get fewer from the NCBI.sqlite file, but it's definitely there.

> library(RSQLite)
Warning message:
package 'RSQLite' was built under R version 4.3.2 
> con <- dbConnect(SQLite(), "NCBI.sqlite")
> dbGetQuery(con, "select count(*) from gene2pubmed;") 
  count(*)
1  4843195
ADD REPLY

Login before adding your answer.

Traffic: 514 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6