Question

Question about creating custom organism database with AnnotationForge

0

Entering edit mode

Adrian • 0

@fb5a0305

Last seen 14 months ago

South Africa

I'm trying to create a custom database for Nothobranchius furzeri for use with my RNAseq data, but am having trouble. Any help will be appreciated.

I downloaded the NCBI data manually and created the SQLite file by following the advice from James W. MacDonald (Query regarding to create custom organism database with AnnotationForge package (AnnotationForge::makeOrgPackageFromNCBI)).

> writeFilesToDb <- function(file, file.dir = ".") {
  require("AnnotationForge", character.only = TRUE,  quietly = TRUE)
  require("RSQLite", character.only = TRUE, quietly = TRUE)
  tmp <- file.path(file.dir, file)
  pfiles <- AnnotationForge:::.primaryFiles()
  file <- pfiles[file]
  NCBIcon <- dbConnect(SQLite(), file.path(file.dir, "NCBI.sqlite"))
  tableName <- sub(".gz","",names(file))
  AnnotationForge:::.writeToNCBIDB(NCBIcon, tableName, filepath=tmp, file)
  AnnotationForge:::.setNCBIDateStamp(NCBIcon, tableName)
  dbDisconnect(NCBIcon)
}
> fls <- dir(".", "^gene.+gz")
> fls
[1] "gene_info.gz"      "gene2accession.gz" "gene2go.gz"       
[4] "gene2pubmed.gz"    "gene2refseq.gz"   
> for(i in fls) writeFilesToDb(i)

This produced a ~43 Gb NCBI.sqlite file. I then downloaded the idmapping file and put it in the same directory (https://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz)

> > makeOrgPackageFromNCBI(version = "0.1",
+                        author = "user@email.com",
+                        maintainer = "user <user@email.com>",
+                        outputDir = ".",
+                        NCBIFilesDir = ".",
+                        tax_id = "105023",
+                        genus = "Nothobranchius",
+                        species = "furzeri",
+                        rebuildCache = FALSE)
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
extracting data for our organism from : gene_info
getting data for gene2go.gz
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Please be patient while we work out which organisms can be annotated
  with ensembl IDs.
TaxID: 9646
TaxID: 8839
TaxID: 28377
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'table' in selecting a method for function '%in%': Timeout was reached: [www.ensembl.org:443] Operation timed out after 10012 milliseconds with 382446 bytes received

Is there a way to get around the timeout, or perhaps save all the data locally so there's no need to connect to online databases?

Thanks in advance!

> sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)

Matrix products: default

locale:
[1] LC_COLLATE=English_GB.utf8 
[2] LC_CTYPE=English_GB.utf8   
[3] LC_MONETARY=English_GB.utf8
[4] LC_NUMERIC=C                         
[5] LC_TIME=English_GB.utf8    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] httr_1.4.7             AnnotationForge_1.40.2 AnnotationDbi_1.60.2  
[4] IRanges_2.32.0         S4Vectors_0.36.2       Biobase_2.58.0        
[7] BiocGenerics_0.44.0   

loaded via a namespace (and not attached):
 [1] rstudioapi_0.15.0      XVector_0.38.0         zlibbioc_1.44.0       
 [4] bit_4.0.5              R6_2.5.1               rlang_1.1.1           
 [7] fastmap_1.1.1          GenomeInfoDb_1.34.9    blob_1.2.4            
[10] tools_4.2.2            png_0.1-8              cli_3.6.1             
[13] DBI_1.1.3              bit64_4.0.5            crayon_1.5.2          
[16] GenomeInfoDbData_1.2.9 BiocManager_1.30.22    bitops_1.0-7          
[19] vctrs_0.6.4            RCurl_1.98-1.12        KEGGREST_1.38.0       
[22] memoise_2.0.1          cachem_1.0.8           RSQLite_2.3.1         
[25] compiler_4.2.2         Biostrings_2.66.0      XML_3.99-0.14

AnnotationForge • 1.0k views

ADD COMMENT • link 14 months ago Adrian • 0

0

Entering edit mode

I updated Bioconductor and R and adjusted the default timeout (this may not have been necessary as it was lightning fast compared to previous attempts). Completed successfully. Thanks to shepherl and James W. MacDonald!

> getOption('timeout')
60
> options(timeout = 10000)

> makeOrgPackageFromNCBI(version = "0.1",
+                        author = "user@email.com",
+                        maintainer = "user <user@email.com>",
+                        outputDir = ".",
+                        NCBIFilesDir = ".",
+                        tax_id = "105023",
+                        genus = "Nothobranchius",
+                        species = "furzeri",
+                        rebuildCache = FALSE)
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
extracting data for our organism from : gene_info
getting data for gene2go.gz
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
processing ensembl gene id data
making the OrgDb package ...
Populating genes table:
genes table filled
Populating pubmed table:
pubmed table filled
<snip>
Populating go_mf_all table:
go_mf_all table filled
Populating go_all table:
go_all table filled
Creating package in ./org.Nfurzeri.eg.db 
Now deleting temporary database file
complete!
[1] "org.Nfurzeri.eg.sqlite"

> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Matrix products: default

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] AnnotationForge_1.42.2 AnnotationDbi_1.62.2   IRanges_2.34.1         S4Vectors_0.38.2       Biobase_2.60.0         BiocGenerics_0.46.0    BiocManager_1.30.22   

loaded via a namespace (and not attached):
 [1] rappdirs_0.3.3          utf8_1.2.3              generics_0.1.3          bitops_1.0-7            xml2_1.3.5              RSQLite_2.3.1           stringi_1.7.12         
 [8] hms_1.1.3               digest_0.6.33           magrittr_2.0.3          GO.db_3.17.0            fastmap_1.1.1           blob_1.2.4              progress_1.2.2         
[15] GenomeInfoDb_1.36.4     DBI_1.1.3               httr_1.4.7              purrr_1.0.2             fansi_1.0.5             XML_3.99-0.14           Biostrings_2.68.1      
[22] cli_3.6.1               rlang_1.1.1             crayon_1.5.2            dbplyr_2.3.4            XVector_0.40.0          bit64_4.0.5             withr_2.5.1            
[29] cachem_1.0.8            tools_4.3.1             memoise_2.0.1           dplyr_1.1.3             GenomeInfoDbData_1.2.10 filelock_1.0.2          curl_5.1.0             
[36] vctrs_0.6.4             R6_2.5.1                png_0.1-8               lifecycle_1.0.3         BiocFileCache_2.8.0     zlibbioc_1.46.0         KEGGREST_1.40.1        
[43] stringr_1.5.0           bit_4.0.5               pkgconfig_2.0.3         pillar_1.9.0            glue_1.6.2              tibble_3.2.1            tidyselect_1.2.0

ADD REPLY • link 14 months ago Adrian • 0

score 1 · Answer 1 · 2023-10-20

1

Entering edit mode

shepherl 4.1k

@lshep

Last seen 6 hours ago

United States

There are default timeout limits. You should be able to do something like options(timeout=100000) to increase to whatever timeout value you need

ADD COMMENT • link 14 months ago shepherl 4.1k

1

Entering edit mode

Also, update to the current version of R/Bioconductor. The functions to query Ensembl have been upgraded to work faster.

ADD REPLY • link 14 months ago James W. MacDonald 67k

0

Entering edit mode

Hi James and shepherl. Thanks for your answers!

I will update Bioconductor and try again later, and then post here. Thanks again for your help!

ADD REPLY • link 14 months ago Adrian • 0