makeOrgPackageFromNCBI - cannot connect to BioMart
1
0
Entering edit mode
lukeholman • 0
@lukeholman-16693
Last seen 6.3 years ago

Hey there,

I'm having some trouble getting makeOrgPackageFromNCBI() to work. As you can see from the output below, it downloads the data from NCBI just fine, then fails when trying to access BioMart. When I click that BioMart link in the error message, there is just a blank webpage with the text 0.7 on it - same as when I tried a week ago. Any ideas? Based on other threads for similar errors, the problem is either that BioMart is down all the time, or my university has a firewall (I've never had trouble with anything else though). SessionInfo is below.

Also, can someone please tell me what is going to be in the Org.db once I finally make it? It's unclear why the message thinks the database will be exactly 12GB - surely that depends on how much info there is for the organism in question? I checked the help files (e.g. ?`OrgDb-class`), and they don't seem to say what's in the db. 

Cheers!

Luke

 

AnnotationForge::makeOrgPackageFromNCBI(
  version="0.1",
  maintainer = "Luke",
  author = "Luke",
  outputDir=getwd(),
  tax_id = "7460",
  genus="Apis",
  species="mellifera",
  NCBIFilesDir=getwd(),
  databaseOnly=FALSE,
  useDeprecatedStyle=FALSE,
  rebuildCache=TRUE,
  verbose=TRUE)

If files are not cached locally this may take awhile to assemble a 12 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene2unigene
[5] gene_info.gz
[6] gene2go.gz
getting data for gene2pubmed.gz
rebuilding the cache
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
rebuilding the cache
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
rebuilding the cache
extracting data for our organism from : gene2refseq
getting data for gene2unigene
rebuilding the cache
extracting data for our organism from : gene2unigene
getting all data for our organism from : gene2unigene
getting data for gene_info.gz
rebuilding the cache
extracting data for our organism from : gene_info
getting data for gene2go.gz
rebuilding the cache
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Please be patient while we work out which organisms can be annotated with ensembl IDs.
Request to BioMart web service failed.
The BioMart web service you're accessing may be down.
Check the following URL and see if this website is available:
http://www.ensembl.org:80/biomart/martservice?type=version&requestid=biomaRt&mart=ENSEMBL_MART_ENSEMBL
Error in if (BioMartVersion == "\n" | BioMartVersion == "") { : 
  argument is of length zero
In addition: Warning messages:
1: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
2: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
3: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
4: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
5: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
6: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
7: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries
8: In result_fetch(res@ptr, n = n) :
  Don't need to call dbFetch() for statements, only for queries

 

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] scales_0.5.0          bindrcpp_0.2.2        clusterProfiler_3.8.1 kableExtra_0.9.0      pander_0.6.2          sva_3.28.0            BiocParallel_1.14.2   genefilter_1.62.0    
 [9] mgcv_1.8-24           nlme_3.1-137          MuMIn_1.42.1          ecodist_2.0.1         gplots_3.0.1          ggjoy_0.4.1           ggridges_0.5.0        RColorBrewer_1.1-2   
[17] gridExtra_2.3         ggdendro_0.1-20       ggrepel_0.8.0         ggplot2_3.0.0         stringr_1.3.1         tidyr_0.8.1           dplyr_0.7.6           reshape2_1.4.3       
[25] RSQLite_2.1.1         WGCNA_1.63            fastcluster_1.1.25    dynamicTreeCut_1.63-1 GOstats_2.46.0        Category_2.46.0       Matrix_1.2-14         biomaRt_2.36.1       
[33] GSEABase_1.42.0       graph_1.58.0          annotate_1.58.0       XML_3.98-1.12         AnnotationDbi_1.42.1  IRanges_2.14.10       S4Vectors_0.18.3      Biobase_2.40.0       
[41] BiocGenerics_0.26.0  

loaded via a namespace (and not attached):
  [1] backports_1.1.2        Hmisc_4.1-1            fastmatch_1.1-0        plyr_1.8.4             igraph_1.2.1           lazyeval_0.2.1         splines_3.5.1          GenomeInfoDb_1.16.0   
  [9] robust_0.4-18          digest_0.6.15          foreach_1.4.4          htmltools_0.3.6        GOSemSim_2.6.0         viridis_0.5.1          GO.db_3.6.0            fansi_0.2.3           
 [17] gdata_2.18.0           magrittr_1.5           checkmate_1.8.5        memoise_1.1.0          fit.models_0.5-14      cluster_2.0.7-1        doParallel_1.0.11      limma_3.36.2          
 [25] readr_1.1.1            matrixStats_0.54.0     enrichplot_1.0.2       prettyunits_1.0.2      colorspace_1.3-2       rvest_0.3.2            blob_1.1.1             rrcov_1.4-4           
 [33] crayon_1.3.4           RCurl_1.95-4.11        bindr_0.1.1            impute_1.54.0          survival_2.42-6        iterators_1.0.10       glue_1.3.0             gtable_0.2.0          
 [41] UpSetR_1.3.3           Rgraphviz_2.24.0       DEoptimR_1.0-8         DOSE_3.6.1             mvtnorm_1.0-8          DBI_1.0.0              Rcpp_0.12.18           viridisLite_0.3.0     
 [49] xtable_1.8-2           progress_1.2.0         htmlTable_1.12         units_0.6-0            foreign_0.8-71         bit_1.1-14             preprocessCore_1.42.0  Formula_1.2-3         
 [57] AnnotationForge_1.22.1 htmlwidgets_1.2        httr_1.3.1             fgsea_1.6.0            acepack_1.4.1          pkgconfig_2.0.1        nnet_7.3-12            dbplyr_1.2.2          
 [65] utf8_1.1.4             labeling_0.3           tidyselect_0.2.4       rlang_0.2.1            munsell_0.5.0          tools_3.5.1            cli_1.0.0              evaluate_0.11         
 [73] yaml_2.2.0             knitr_1.20             bit64_0.9-7            robustbase_0.93-1.1    caTools_1.17.1.1       purrr_0.2.5            ggraph_1.0.2           RBGL_1.56.0           
 [81] xml2_1.2.0             DO.db_2.9              compiler_3.5.1         rstudioapi_0.7         tibble_1.4.2           tweenr_0.1.5           pcaPP_1.9-73           stringi_1.2.4         
 [89] highr_0.7              lattice_0.20-35        pillar_1.3.0           data.table_1.11.4      cowplot_0.9.3          bitops_1.0-6           qvalue_2.12.0          R6_2.2.2              
 [97] latticeExtra_0.6-28    KernSmooth_2.23-15     codetools_0.2-15       MASS_7.3-50            gtools_3.8.1           assertthat_0.2.0       rprojroot_1.3-2        withr_2.1.2           
[105] GenomeInfoDbData_1.1.0 hms_0.4.2              rpart_4.1-13           rmarkdown_1.10         rvcheck_0.1.0          ggforce_0.1.3          base64enc_0.1-3       

annotationforge • 1.3k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 32 minutes ago
United States

The message about a 12Gb cache database doesn't pertain to your OrgDb package - it's an informative message about all the data that are going to be downloaded from NCBI in order to generate the OrgDb package.

The OrgDb package will contain mappings between the Entrez Gene IDs for A. mellifera and IDs from various other databases (gene symbol, gene name, GO terms, etc). See the AnnotationDbi vignette for more information about OrgDb packages.

As for the Biomart query, the problem is most likely that the query is going to ensembl.org, when it should go to metazoa.ensembl.org. Hypothetically it should know to use the correct ensembl webdir, depending on the taxid, but that might be pretty tricky and it's not clear if there is any profit in trying to get it to work.

I say that because there is already a pre-built annotation package for this species:

> library(AnnotationHub)

> hub <- AnnotationHub()

> query(hub, c("mellifera","sqlite"))
AnnotationHub with 5 records
# snapshotDate(): 2018-04-30
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, Inparanoid8
# $species: Apis mellifera, Apis mellifera_cerana, Apis mellifera_dorsata, A...
# $rdataclass: OrgDb, Inparanoid8Db
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH10452"]]'

            title                              
  AH10452 | hom.Apis_mellifera.inp8.sqlite     
  AH62534 | org.Apis_mellifera.eg.sqlite       
  AH62636 | org.Apis_mellifera_cerana.eg.sqlite
  AH62643 | org.Apis_mellifera_florea.eg.sqlite
  AH62690 | org.Apis_mellifera_dorsata.eg.sqlite

> zz <- hub[["AH62534"]]
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

> zz
OrgDb object:
| DBSCHEMAVERSION: 2.1
| DBSCHEMA: NOSCHEMA_DB
| ORGANISM: Apis mellifera
| SPECIES: Apis mellifera
| CENTRALID: GID
| Taxonomy ID: 7460
| Db type: OrgDb
| Supporting package: AnnotationDbi

Please see: help('select') for usage information
> columns(zz)
[1] "ACCNUM"      "ALIAS"       "CHR"         "ENTREZID"    "EVIDENCE"  
[6] "EVIDENCEALL" "GENENAME"    "GID"         "GO"          "GOALL"     
[11] "ONTOLOGY"    "ONTOLOGYALL" "PMID"        "REFSEQ"      "SYMBOL"    
[16] "UNIGENE"   

The only thing that this appears to be missing is the mapping from Entrez Gene ID to Ensembl ID. Which IMO you shouldn't really be trying to do anyway, because there are any number of complicating factors to that sort of mapping. If you want to annotate using Ensembl, use metazoa.ensembl.org and the biomaRt package.

 

ADD COMMENT
0
Entering edit mode

Great, thanks for the help! I'll have a look into it. Some ideas to improve the help file for makeOrgPackageFromNCBI:

- Explain that it is possible to search for available annotation packages using a separate package (AnnotationHub). I thought there were only 20 pre-built packages for the 'classic' organisms (human, mouse, fly, zebrafish, etc...), because I read that somewhere. If there are secretly dozens more pre-built ones for all sorts of random species, that'd be good to mention (perhaps in the vignette for this function: https://bioconductor.org/packages/devel/bioc/vignettes/AnnotationForge/inst/doc/MakingNewOrganismPackages.html)

- Add a more informative error message, that explains what you just explained to me (e.g. "Maybe the function is choosing the wrong default, try manually specifying metazoa.ensembl.org or whatever"). You could also add an optional argument to the function to specify which type of organism it is, hiding the internal workings from users who just want to get their database working. 

 

ADD REPLY

Login before adding your answer.

Traffic: 753 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6