Question

get_gene_transcript_exon_tables.pl extremely slow

1

Entering edit mode

Ed Siefker ▴ 230

@ed-siefker-5136

Last seen 5 months ago

United States

I am trying to build an up to date EnsDB following the vignette. I have the ensembl PERL API installed. fetchTablesFromEnsembl() is running, but extremely slowly. After about 2 hours, I have 3 meg of text files.

> fetchTablesFromEnsembl(90, species = "mouse")
Connecting to ensembldb.ensembl.org at port 5306
# get_gene_transcript_exon_tables.pl version 0.3.0:
Retrieve gene models for Ensembl version 90, species mouse from Ensembl database at host: ensembldb.ensembl.org
Start fetching data

$ du -shc *.txt
512B    ens_chromosome.txt
15K    ens_entrezgene.txt
611K    ens_exon.txt
36K    ens_gene.txt
860K    ens_protein.txt
477K    ens_protein_domain.txt
249K    ens_tx.txt
604K    ens_tx2exon.txt
79K    ens_uniprot.txt
2.9M    total

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
76085 esiefker 1 20 0 139M 76464K sbwait 0 1:28 0.91% perl

Perl is spending all its time in 'sbwait'. (FreeBSD 11) Any ideas on how to improve this?

> sessionInfo()

Would be included but the forum is telling me:

"Language "af" is not one of the supported languages ['en']!"

ensembl ensembldb • 1.2k views

ADD COMMENT • link updated 6.5 years ago by Johannes Rainer ★ 2.0k • written 6.5 years ago by Ed Siefker ▴ 230

0

Entering edit mode

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: amd64-portbld-freebsd11.0 (64-bit)
Running under: FreeBSD bio 11.0-STABLE FreeBSD 11.0-STABLE #0 r321665+25fe8ba8d06(freenas/11.0-stable): Mon Sep 25 06:24:11 UTC 2017 root@gauntlet:/freenas-11-releng/freenas/_BE/objs/freenas-11-releng/freenas/_BE/os/sys/FreeNAS.amd64 amd64

Matrix products: default
LAPACK: /usr/local/lib/R/lib/libRlapack.so

locale:
[1] C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] ensembldb_2.0.4        AnnotationFilter_1.0.0 GenomicFeatures_1.28.5
[4] AnnotationDbi_1.38.2   Biobase_2.36.2         GenomicRanges_1.28.6
[7] GenomeInfoDb_1.12.3    IRanges_2.10.5         S4Vectors_0.14.7
[10] BiocGenerics_0.22.1

loaded via a namespace (and not attached):
[1] Rcpp_0.12.13                  BiocInstaller_1.26.1
[3] AnnotationHub_2.8.3           compiler_3.4.2
[5] XVector_0.16.0                ProtGenerics_1.8.0
[7] bitops_1.0-6                  tools_3.4.2
[9] zlibbioc_1.22.0               biomaRt_2.32.1
[11] digest_0.6.12                 bit_1.1-12
[13] RSQLite_2.0                   memoise_1.1.0
[15] tibble_1.3.4                  lattice_0.20-35
[17] rlang_0.1.2                   Matrix_1.2-11
[19] shiny_1.0.5                   DelayedArray_0.2.7
[21] DBI_0.7                       curl_3.0
[23] yaml_2.1.14                   GenomeInfoDbData_0.99.0
[25] httr_1.3.1                    rtracklayer_1.36.6
[27] Biostrings_2.44.2             bit64_0.9-7
[29] grid_3.4.2                    R6_2.2.2
[31] XML_3.98-1.9                  BiocParallel_1.10.1
[33] blob_1.1.0                    htmltools_0.3.6
[35] Rsamtools_1.28.0              matrixStats_0.52.2
[37] GenomicAlignments_1.12.2      SummarizedExperiment_1.6.5
[39] xtable_1.8-2                  mime_0.5
[41] interactiveDisplayBase_1.14.0 httpuv_1.3.5
[43] RCurl_1.95-4.8                lazyeval_0.2.0
>

ADD REPLY • link 6.5 years ago Ed Siefker ▴ 230

score 0 · Answer 1 · 2017-11-02

Hi Ed,

using the perl scripts and the Ensembl perl API is indeed very slow. It is even worse if you fetch the data from the main Ensembl MySQL servers. What I am doing with each new Ensembl release is to download first the MySQL database dumps from Ensembl and install them locally. This improves performance, but it will still take several hours to generate the full data base.

The good news: you don't need to build the package yourself. With each new Ensembl release I am building EnsDb databases for each species and I'm adding them to AnnotationHub. You will need Bioconductor version 3.6 (just released) and you can download and use the Ensembl 90 EnsDb for mouse from there:

> library(AnnotationHub)
> query(AnnotationHub(), "EnsDb.Mmusculus")
snapshotDate(): 2017-10-27
AnnotationHub with 4 records
# snapshotDate(): 2017-10-27
# $dataprovider: Ensembl
# $species: Mus Musculus
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH53222"]]'

            title                            
  AH53222 | Ensembl 87 EnsDb for Mus Musculus
  AH53726 | Ensembl 88 EnsDb for Mus Musculus
  AH56691 | Ensembl 89 EnsDb for Mus Musculus
  AH57770 | Ensembl 90 EnsDb for Mus Musculus
> edb <- AnnotationHub()[["AH57770"]]
snapshotDate(): 2017-10-27
require(“ensembldb”)
downloading from 'https://annotationhub.bioconductor.org/fetch/64508'
retrieving 1 resource
  |======================================================================| 100%
> edb
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.1
|Creation time: Sat Aug 26 22:06:14 2017
|ensembl_version: 90
|ensembl_host: localhost
|Organism: mus_musculus
|taxonomy_id: 10090
|genome_build: GRCm38
|DBSCHEMAVERSION: 2.1
| No. of genes: 53110.
| No. of transcripts: 132305.
|Protein data available.

If you still want/need to build it on your own you have to be patient - it is very slow.

hope this helped

cheers, jo