Hello everyone,
I am PhD student working on performing GSEA analysis for Candida albicans data. I queried AnnotationHub for existing records and found none. Hence I am trying to make an Organism DB for C. albicans. After going through the threads of Problem making orgdb package for bacteria (Pseudomonas) using annotation hub and annotation forge; error with downloading NCBI data for makeorgpackagefromNCBI? ; AnnotationForge not working for building custom org packages; I am still encountering the following errors. I have tried downloading the files directly from https://ftp.ncbi.nlm.nih.gov/gene/DATA/ after deleting the NCBI.sqlite file, but to no avail. I even tried changing the timeout settings to 10000. Any help in this regard is highly appreciated.
> hub <- AnnotationHub()
|=====================================================================================| 100%
snapshotDate(): 2023-10-23
> query(hub, c("OrgDb","Candida albicans"))
AnnotationHub with 0 records
# snapshotDate(): 2023-10-23
> getOption('timeout')
[1] 60
> options(timeout = 10000)
> getOption('timeout')
[1] 10000
> makeOrgPackageFromNCBI("0.1",
+ "Gayatri <gayatri@catg.edu.in>",
+ "Gayatri",
+ ".",
+ "237561",
+ "Candida",
+ "albicans",
+ rebuildCache = FALSE)
preparing data from NCBI ...
starting download for
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
Error: no such table: main.gene2pubmed
> list.files()
[1] "gene_info.gz" "gene2accession.gz" "gene2go.gz" "gene2pubmed.gz"
[5] "gene2refseq.gz"
> sessionInfo()
R version 4.3.3 (2024-02-29 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: Asia/Calcutta
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] AnnotationForge_1.44.0 biomaRt_2.58.2 AnnotationHub_3.10.1
[4] BiocFileCache_2.10.2 dbplyr_2.5.0 GenomeInfoDb_1.38.8
[7] AnnotationDbi_1.64.1 IRanges_2.36.0 S4Vectors_0.40.2
[10] Biobase_2.62.0 BiocGenerics_0.48.1
loaded via a namespace (and not attached):
[1] KEGGREST_1.42.0 vctrs_0.6.5 tools_4.3.3
[4] bitops_1.0-7 generics_0.1.3 curl_5.2.1
[7] tibble_3.2.1 fansi_1.0.6 RSQLite_2.3.6
[10] blob_1.2.4 pkgconfig_2.0.3 lifecycle_1.0.4
[13] GenomeInfoDbData_1.2.11 compiler_4.3.3 stringr_1.5.1
[16] Biostrings_2.70.3 progress_1.2.3 httpuv_1.6.15
[19] htmltools_0.5.8.1 RCurl_1.98-1.14 yaml_2.3.8
[22] interactiveDisplayBase_1.40.0 pillar_1.9.0 later_1.3.2
[25] crayon_1.5.2 cachem_1.0.8 mime_0.12
[28] tidyselect_1.2.1 digest_0.6.35 stringi_1.8.3
[31] dplyr_1.1.4 BiocVersion_3.18.1 fastmap_1.1.1
[34] cli_3.6.2 magrittr_2.0.3 XML_3.99-0.16.1
[37] utf8_1.2.4 prettyunits_1.2.0 filelock_1.0.3
[40] promises_1.3.0 rappdirs_0.3.3 bit64_4.0.5
[43] XVector_0.42.0 httr_1.4.7 bit_4.0.5
[46] png_0.1-8 hms_1.1.3 memoise_2.0.1
[49] shiny_1.8.1.1 rlang_1.1.3 Rcpp_1.0.12
[52] xtable_1.8-4 glue_1.7.0 DBI_1.2.2
[55] xml2_1.3.6 BiocManager_1.30.22 R6_2.5.1
[58] zlibbioc_1.48.2
Warning message:
call dbDisconnect() when finished working with a connection
Hello James,
How to generate the db? By running the makeOrgPackageFromNCBI() command? I have tried doing that, first by just deleting the NCBI.sqlite file and running the command; and then deleting both the NCBI.sqlite as well as gene2pubmed file as the size of the file is around 180 Mb. But then I am still getting the error that gene2accession file is partially transferred as that too is re-downloaded, even though it is already downloaded prior to running the command.
I don't really follow what you are saying. All you have to do is delete the NCBI.sqlite db and re-run the script exactly as you did above. If you say
rebuildCache = FALSE
you shouldn't download anything. And the error you got before didn't say anything about downloading files. It said that you were missing the gene2pubmed table.What I meant is even after deleting the NCBI.sqlite file, and re-running the script, an empty NCBI.sqlite file (0 kb) is created which is causing the error I mentioned:
preparing data from NCBI ... starting download for [1] gene2pubmed.gz [2] gene2accession.gz [3] gene2refseq.gz [4] gene_info.gz [5] gene2go.gz getting data for gene2pubmed.gz Error: no such table: main.gene2pubmed
I ensured any partially created NCBI.sqlite files are deleted, then downloaded the data directly from the NCBI, and then only re-ran the script. But the creation of this empty NCBI. sqlite file is causing the script to terminate. What do you suggest I do regarding this?
That's weird. I don't have any problem at all generating the
OrgDb
on my box. How big is the gene2pubmed.gz file? I get this:So just over 57M rows. I get fewer from the NCBI.sqlite file, but it's definitely there.