I'm trying to create a custom database for Nothobranchius furzeri for use with my RNAseq data, but am having trouble. Any help will be appreciated.
I downloaded the NCBI data manually and created the SQLite file by following the advice from James W. MacDonald (Query regarding to create custom organism database with AnnotationForge package (AnnotationForge::makeOrgPackageFromNCBI)).
> writeFilesToDb <- function(file, file.dir = ".") {
require("AnnotationForge", character.only = TRUE, quietly = TRUE)
require("RSQLite", character.only = TRUE, quietly = TRUE)
tmp <- file.path(file.dir, file)
pfiles <- AnnotationForge:::.primaryFiles()
file <- pfiles[file]
NCBIcon <- dbConnect(SQLite(), file.path(file.dir, "NCBI.sqlite"))
tableName <- sub(".gz","",names(file))
AnnotationForge:::.writeToNCBIDB(NCBIcon, tableName, filepath=tmp, file)
AnnotationForge:::.setNCBIDateStamp(NCBIcon, tableName)
dbDisconnect(NCBIcon)
}
> fls <- dir(".", "^gene.+gz")
> fls
[1] "gene_info.gz" "gene2accession.gz" "gene2go.gz"
[4] "gene2pubmed.gz" "gene2refseq.gz"
> for(i in fls) writeFilesToDb(i)
This produced a ~43 Gb NCBI.sqlite file. I then downloaded the idmapping file and put it in the same directory (https://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz)
> > makeOrgPackageFromNCBI(version = "0.1",
+ author = "user@email.com",
+ maintainer = "user <user@email.com>",
+ outputDir = ".",
+ NCBIFilesDir = ".",
+ tax_id = "105023",
+ genus = "Nothobranchius",
+ species = "furzeri",
+ rebuildCache = FALSE)
preparing data from NCBI ...
starting download for
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
extracting data for our organism from : gene_info
getting data for gene2go.gz
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Please be patient while we work out which organisms can be annotated
with ensembl IDs.
TaxID: 9646
TaxID: 8839
TaxID: 28377
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'table' in selecting a method for function '%in%': Timeout was reached: [www.ensembl.org:443] Operation timed out after 10012 milliseconds with 382446 bytes received
Is there a way to get around the timeout, or perhaps save all the data locally so there's no need to connect to online databases?
Thanks in advance!
> sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_GB.utf8
[2] LC_CTYPE=English_GB.utf8
[3] LC_MONETARY=English_GB.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_GB.utf8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] httr_1.4.7 AnnotationForge_1.40.2 AnnotationDbi_1.60.2
[4] IRanges_2.32.0 S4Vectors_0.36.2 Biobase_2.58.0
[7] BiocGenerics_0.44.0
loaded via a namespace (and not attached):
[1] rstudioapi_0.15.0 XVector_0.38.0 zlibbioc_1.44.0
[4] bit_4.0.5 R6_2.5.1 rlang_1.1.1
[7] fastmap_1.1.1 GenomeInfoDb_1.34.9 blob_1.2.4
[10] tools_4.2.2 png_0.1-8 cli_3.6.1
[13] DBI_1.1.3 bit64_4.0.5 crayon_1.5.2
[16] GenomeInfoDbData_1.2.9 BiocManager_1.30.22 bitops_1.0-7
[19] vctrs_0.6.4 RCurl_1.98-1.12 KEGGREST_1.38.0
[22] memoise_2.0.1 cachem_1.0.8 RSQLite_2.3.1
[25] compiler_4.2.2 Biostrings_2.66.0 XML_3.99-0.14
I updated Bioconductor and R and adjusted the default timeout (this may not have been necessary as it was lightning fast compared to previous attempts). Completed successfully. Thanks to shepherl and James W. MacDonald!