Question

makeOrgPackageFromNCBI - cannot connect to BioMart

0

Entering edit mode

lukeholman • 0

@lukeholman-16693

Last seen 7.5 years ago

Hey there,

I'm having some trouble getting makeOrgPackageFromNCBI() to work. As you can see from the output below, it downloads the data from NCBI just fine, then fails when trying to access BioMart. When I click that BioMart link in the error message, there is just a blank webpage with the text 0.7 on it - same as when I tried a week ago. Any ideas? Based on other threads for similar errors, the problem is either that BioMart is down all the time, or my university has a firewall (I've never had trouble with anything else though). SessionInfo is below.

Also, can someone please tell me what is going to be in the Org.db once I finally make it? It's unclear why the message thinks the database will be exactly 12GB - surely that depends on how much info there is for the organism in question? I checked the help files (e.g. ?`OrgDb-class`), and they don't seem to say what's in the db.

Cheers!

Luke

AnnotationForge::makeOrgPackageFromNCBI( version="0.1", maintainer = "Luke", author = "Luke", outputDir=getwd(), tax_id = "7460", genus="Apis", species="mellifera", NCBIFilesDir=getwd(), databaseOnly=FALSE, useDeprecatedStyle=FALSE, rebuildCache=TRUE, verbose=TRUE)

If files are not cached locally this may take awhile to assemble a 12 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day. preparing data from NCBI ... starting download for [1] gene2pubmed.gz [2] gene2accession.gz [3] gene2refseq.gz [4] gene2unigene [5] gene_info.gz [6] gene2go.gz getting data for gene2pubmed.gz rebuilding the cache extracting data for our organism from : gene2pubmed getting data for gene2accession.gz rebuilding the cache extracting data for our organism from : gene2accession getting data for gene2refseq.gz rebuilding the cache extracting data for our organism from : gene2refseq getting data for gene2unigene rebuilding the cache extracting data for our organism from : gene2unigene getting all data for our organism from : gene2unigene getting data for gene_info.gz rebuilding the cache extracting data for our organism from : gene_info getting data for gene2go.gz rebuilding the cache extracting data for our organism from : gene2go processing gene2pubmed processing gene_info: chromosomes processing gene_info: description processing alias data processing refseq data processing accession data processing GO data Please be patient while we work out which organisms can be annotated with ensembl IDs. Request to BioMart web service failed. The BioMart web service you're accessing may be down. Check the following URL and see if this website is available: http://www.ensembl.org:80/biomart/martservice?type=version&requestid=biomaRt&mart=ENSEMBL_MART_ENSEMBL Error in if (BioMartVersion == "\n" | BioMartVersion == "") { : argument is of length zero In addition: Warning messages: 1: In result_fetch(res@ptr, n = n) : Don't need to call dbFetch() for statements, only for queries 2: In result_fetch(res@ptr, n = n) : Don't need to call dbFetch() for statements, only for queries 3: In result_fetch(res@ptr, n = n) : Don't need to call dbFetch() for statements, only for queries 4: In result_fetch(res@ptr, n = n) : Don't need to call dbFetch() for statements, only for queries 5: In result_fetch(res@ptr, n = n) : Don't need to call dbFetch() for statements, only for queries 6: In result_fetch(res@ptr, n = n) : Don't need to call dbFetch() for statements, only for queries 7: In result_fetch(res@ptr, n = n) : Don't need to call dbFetch() for statements, only for queries 8: In result_fetch(res@ptr, n = n) : Don't need to call dbFetch() for statements, only for queries

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] grid stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] scales_0.5.0 bindrcpp_0.2.2 clusterProfiler_3.8.1 kableExtra_0.9.0 pander_0.6.2 sva_3.28.0 BiocParallel_1.14.2 genefilter_1.62.0
[9] mgcv_1.8-24 nlme_3.1-137 MuMIn_1.42.1 ecodist_2.0.1 gplots_3.0.1 ggjoy_0.4.1 ggridges_0.5.0 RColorBrewer_1.1-2
[17] gridExtra_2.3 ggdendro_0.1-20 ggrepel_0.8.0 ggplot2_3.0.0 stringr_1.3.1 tidyr_0.8.1 dplyr_0.7.6 reshape2_1.4.3
[25] RSQLite_2.1.1 WGCNA_1.63 fastcluster_1.1.25 dynamicTreeCut_1.63-1 GOstats_2.46.0 Category_2.46.0 Matrix_1.2-14 biomaRt_2.36.1
[33] GSEABase_1.42.0 graph_1.58.0 annotate_1.58.0 XML_3.98-1.12 AnnotationDbi_1.42.1 IRanges_2.14.10 S4Vectors_0.18.3 Biobase_2.40.0
[41] BiocGenerics_0.26.0

loaded via a namespace (and not attached):
[1] backports_1.1.2 Hmisc_4.1-1 fastmatch_1.1-0 plyr_1.8.4 igraph_1.2.1 lazyeval_0.2.1 splines_3.5.1 GenomeInfoDb_1.16.0
[9] robust_0.4-18 digest_0.6.15 foreach_1.4.4 htmltools_0.3.6 GOSemSim_2.6.0 viridis_0.5.1 GO.db_3.6.0 fansi_0.2.3
[17] gdata_2.18.0 magrittr_1.5 checkmate_1.8.5 memoise_1.1.0 fit.models_0.5-14 cluster_2.0.7-1 doParallel_1.0.11 limma_3.36.2
[25] readr_1.1.1 matrixStats_0.54.0 enrichplot_1.0.2 prettyunits_1.0.2 colorspace_1.3-2 rvest_0.3.2 blob_1.1.1 rrcov_1.4-4
[33] crayon_1.3.4 RCurl_1.95-4.11 bindr_0.1.1 impute_1.54.0 survival_2.42-6 iterators_1.0.10 glue_1.3.0 gtable_0.2.0
[41] UpSetR_1.3.3 Rgraphviz_2.24.0 DEoptimR_1.0-8 DOSE_3.6.1 mvtnorm_1.0-8 DBI_1.0.0 Rcpp_0.12.18 viridisLite_0.3.0
[49] xtable_1.8-2 progress_1.2.0 htmlTable_1.12 units_0.6-0 foreign_0.8-71 bit_1.1-14 preprocessCore_1.42.0 Formula_1.2-3
[57] AnnotationForge_1.22.1 htmlwidgets_1.2 httr_1.3.1 fgsea_1.6.0 acepack_1.4.1 pkgconfig_2.0.1 nnet_7.3-12 dbplyr_1.2.2
[65] utf8_1.1.4 labeling_0.3 tidyselect_0.2.4 rlang_0.2.1 munsell_0.5.0 tools_3.5.1 cli_1.0.0 evaluate_0.11
[73] yaml_2.2.0 knitr_1.20 bit64_0.9-7 robustbase_0.93-1.1 caTools_1.17.1.1 purrr_0.2.5 ggraph_1.0.2 RBGL_1.56.0
[81] xml2_1.2.0 DO.db_2.9 compiler_3.5.1 rstudioapi_0.7 tibble_1.4.2 tweenr_0.1.5 pcaPP_1.9-73 stringi_1.2.4
[89] highr_0.7 lattice_0.20-35 pillar_1.3.0 data.table_1.11.4 cowplot_0.9.3 bitops_1.0-6 qvalue_2.12.0 R6_2.2.2
[97] latticeExtra_0.6-28 KernSmooth_2.23-15 codetools_0.2-15 MASS_7.3-50 gtools_3.8.1 assertthat_0.2.0 rprojroot_1.3-2 withr_2.1.2
[105] GenomeInfoDbData_1.1.0 hms_0.4.2 rpart_4.1-13 rmarkdown_1.10 rvcheck_0.1.0 ggforce_0.1.3 base64enc_0.1-3
>

annotationforge • 1.7k views

ADD COMMENT • link updated 7.5 years ago by James W. MacDonald 68k • written 7.5 years ago by lukeholman • 0

score 1 · Answer 1 · 2018-07-30

The message about a 12Gb cache database doesn't pertain to your OrgDb package - it's an informative message about all the data that are going to be downloaded from NCBI in order to generate the OrgDb package.

The OrgDb package will contain mappings between the Entrez Gene IDs for A. mellifera and IDs from various other databases (gene symbol, gene name, GO terms, etc). See the AnnotationDbi vignette for more information about OrgDb packages.

As for the Biomart query, the problem is most likely that the query is going to ensembl.org, when it should go to metazoa.ensembl.org. Hypothetically it should know to use the correct ensembl webdir, depending on the taxid, but that might be pretty tricky and it's not clear if there is any profit in trying to get it to work.

I say that because there is already a pre-built annotation package for this species:

> library(AnnotationHub)

> hub <- AnnotationHub()

> query(hub, c("mellifera","sqlite"))
AnnotationHub with 5 records
# snapshotDate(): 2018-04-30
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, Inparanoid8
# $species: Apis mellifera, Apis mellifera_cerana, Apis mellifera_dorsata, A...
# $rdataclass: OrgDb, Inparanoid8Db
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH10452"]]'

            title                              
  AH10452 | hom.Apis_mellifera.inp8.sqlite     
  AH62534 | org.Apis_mellifera.eg.sqlite       
  AH62636 | org.Apis_mellifera_cerana.eg.sqlite
  AH62643 | org.Apis_mellifera_florea.eg.sqlite
  AH62690 | org.Apis_mellifera_dorsata.eg.sqlite

> zz <- hub[["AH62534"]]
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

> zz
OrgDb object:
| DBSCHEMAVERSION: 2.1
| DBSCHEMA: NOSCHEMA_DB
| ORGANISM: Apis mellifera
| SPECIES: Apis mellifera
| CENTRALID: GID
| Taxonomy ID: 7460
| Db type: OrgDb
| Supporting package: AnnotationDbi

Please see: help('select') for usage information
> columns(zz)
[1] "ACCNUM"      "ALIAS"       "CHR"         "ENTREZID"    "EVIDENCE"  
[6] "EVIDENCEALL" "GENENAME"    "GID"         "GO"          "GOALL"     
[11] "ONTOLOGY"    "ONTOLOGYALL" "PMID"        "REFSEQ"      "SYMBOL"    
[16] "UNIGENE"

The only thing that this appears to be missing is the mapping from Entrez Gene ID to Ensembl ID. Which IMO you shouldn't really be trying to do anyway, because there are any number of complicating factors to that sort of mapping. If you want to annotate using Ensembl, use metazoa.ensembl.org and the biomaRt package.