Question about creating custom organism database with AnnotationForge
1
0
Entering edit mode
Adrian • 0
@fb5a0305
Last seen 13 months ago
South Africa

I'm trying to create a custom database for Nothobranchius furzeri for use with my RNAseq data, but am having trouble. Any help will be appreciated.

I downloaded the NCBI data manually and created the SQLite file by following the advice from James W. MacDonald (Query regarding to create custom organism database with AnnotationForge package (AnnotationForge::makeOrgPackageFromNCBI)).

> writeFilesToDb <- function(file, file.dir = ".") {
  require("AnnotationForge", character.only = TRUE,  quietly = TRUE)
  require("RSQLite", character.only = TRUE, quietly = TRUE)
  tmp <- file.path(file.dir, file)
  pfiles <- AnnotationForge:::.primaryFiles()
  file <- pfiles[file]
  NCBIcon <- dbConnect(SQLite(), file.path(file.dir, "NCBI.sqlite"))
  tableName <- sub(".gz","",names(file))
  AnnotationForge:::.writeToNCBIDB(NCBIcon, tableName, filepath=tmp, file)
  AnnotationForge:::.setNCBIDateStamp(NCBIcon, tableName)
  dbDisconnect(NCBIcon)
}
> fls <- dir(".", "^gene.+gz")
> fls
[1] "gene_info.gz"      "gene2accession.gz" "gene2go.gz"       
[4] "gene2pubmed.gz"    "gene2refseq.gz"   
> for(i in fls) writeFilesToDb(i)

This produced a ~43 Gb NCBI.sqlite file. I then downloaded the idmapping file and put it in the same directory (https://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz)

> > makeOrgPackageFromNCBI(version = "0.1",
+                        author = "user@email.com",
+                        maintainer = "user <user@email.com>",
+                        outputDir = ".",
+                        NCBIFilesDir = ".",
+                        tax_id = "105023",
+                        genus = "Nothobranchius",
+                        species = "furzeri",
+                        rebuildCache = FALSE)
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
extracting data for our organism from : gene_info
getting data for gene2go.gz
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Please be patient while we work out which organisms can be annotated
  with ensembl IDs.
TaxID: 9646
TaxID: 8839
TaxID: 28377
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'table' in selecting a method for function '%in%': Timeout was reached: [www.ensembl.org:443] Operation timed out after 10012 milliseconds with 382446 bytes received

Is there a way to get around the timeout, or perhaps save all the data locally so there's no need to connect to online databases?

Thanks in advance!

> sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)

Matrix products: default

locale:
[1] LC_COLLATE=English_GB.utf8 
[2] LC_CTYPE=English_GB.utf8   
[3] LC_MONETARY=English_GB.utf8
[4] LC_NUMERIC=C                         
[5] LC_TIME=English_GB.utf8    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] httr_1.4.7             AnnotationForge_1.40.2 AnnotationDbi_1.60.2  
[4] IRanges_2.32.0         S4Vectors_0.36.2       Biobase_2.58.0        
[7] BiocGenerics_0.44.0   

loaded via a namespace (and not attached):
 [1] rstudioapi_0.15.0      XVector_0.38.0         zlibbioc_1.44.0       
 [4] bit_4.0.5              R6_2.5.1               rlang_1.1.1           
 [7] fastmap_1.1.1          GenomeInfoDb_1.34.9    blob_1.2.4            
[10] tools_4.2.2            png_0.1-8              cli_3.6.1             
[13] DBI_1.1.3              bit64_4.0.5            crayon_1.5.2          
[16] GenomeInfoDbData_1.2.9 BiocManager_1.30.22    bitops_1.0-7          
[19] vctrs_0.6.4            RCurl_1.98-1.12        KEGGREST_1.38.0       
[22] memoise_2.0.1          cachem_1.0.8           RSQLite_2.3.1         
[25] compiler_4.2.2         Biostrings_2.66.0      XML_3.99-0.14
AnnotationForge • 1.0k views
ADD COMMENT
0
Entering edit mode

I updated Bioconductor and R and adjusted the default timeout (this may not have been necessary as it was lightning fast compared to previous attempts). Completed successfully. Thanks to shepherl and James W. MacDonald!

> getOption('timeout')
60
> options(timeout = 10000)

> makeOrgPackageFromNCBI(version = "0.1",
+                        author = "user@email.com",
+                        maintainer = "user <user@email.com>",
+                        outputDir = ".",
+                        NCBIFilesDir = ".",
+                        tax_id = "105023",
+                        genus = "Nothobranchius",
+                        species = "furzeri",
+                        rebuildCache = FALSE)
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
extracting data for our organism from : gene_info
getting data for gene2go.gz
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
processing ensembl gene id data
making the OrgDb package ...
Populating genes table:
genes table filled
Populating pubmed table:
pubmed table filled
<snip>
Populating go_mf_all table:
go_mf_all table filled
Populating go_all table:
go_all table filled
Creating package in ./org.Nfurzeri.eg.db 
Now deleting temporary database file
complete!
[1] "org.Nfurzeri.eg.sqlite"
> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Matrix products: default

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] AnnotationForge_1.42.2 AnnotationDbi_1.62.2   IRanges_2.34.1         S4Vectors_0.38.2       Biobase_2.60.0         BiocGenerics_0.46.0    BiocManager_1.30.22   

loaded via a namespace (and not attached):
 [1] rappdirs_0.3.3          utf8_1.2.3              generics_0.1.3          bitops_1.0-7            xml2_1.3.5              RSQLite_2.3.1           stringi_1.7.12         
 [8] hms_1.1.3               digest_0.6.33           magrittr_2.0.3          GO.db_3.17.0            fastmap_1.1.1           blob_1.2.4              progress_1.2.2         
[15] GenomeInfoDb_1.36.4     DBI_1.1.3               httr_1.4.7              purrr_1.0.2             fansi_1.0.5             XML_3.99-0.14           Biostrings_2.68.1      
[22] cli_3.6.1               rlang_1.1.1             crayon_1.5.2            dbplyr_2.3.4            XVector_0.40.0          bit64_4.0.5             withr_2.5.1            
[29] cachem_1.0.8            tools_4.3.1             memoise_2.0.1           dplyr_1.1.3             GenomeInfoDbData_1.2.10 filelock_1.0.2          curl_5.1.0             
[36] vctrs_0.6.4             R6_2.5.1                png_0.1-8               lifecycle_1.0.3         BiocFileCache_2.8.0     zlibbioc_1.46.0         KEGGREST_1.40.1        
[43] stringr_1.5.0           bit_4.0.5               pkgconfig_2.0.3         pillar_1.9.0            glue_1.6.2              tibble_3.2.1            tidyselect_1.2.0
ADD REPLY
1
Entering edit mode
shepherl 4.1k
@lshep
Last seen 1 day ago
United States

There are default timeout limits. You should be able to do something like options(timeout=100000) to increase to whatever timeout value you need

ADD COMMENT
1
Entering edit mode

Also, update to the current version of R/Bioconductor. The functions to query Ensembl have been upgraded to work faster.

ADD REPLY
0
Entering edit mode

Hi James and shepherl. Thanks for your answers!

I will update Bioconductor and try again later, and then post here. Thanks again for your help!

ADD REPLY

Login before adding your answer.

Traffic: 781 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6