get_gene_transcript_exon_tables.pl extremely slow
1
1
Entering edit mode
Ed Siefker ▴ 230
@ed-siefker-5136
Last seen 5 months ago
United States

I am trying to build an up to date EnsDB following the vignette.  I have the ensembl PERL API installed. fetchTablesFromEnsembl() is running, but extremely slowly.  After about 2 hours, I have 3 meg of text files.

> fetchTablesFromEnsembl(90, species = "mouse")
Connecting to ensembldb.ensembl.org at port 5306
# get_gene_transcript_exon_tables.pl version 0.3.0:
Retrieve gene models for Ensembl version 90, species mouse from Ensembl database at host: ensembldb.ensembl.org
Start fetching data

$ du -shc *.txt
512B    ens_chromosome.txt
 15K    ens_entrezgene.txt
611K    ens_exon.txt
 36K    ens_gene.txt
860K    ens_protein.txt
477K    ens_protein_domain.txt
249K    ens_tx.txt
604K    ens_tx2exon.txt
 79K    ens_uniprot.txt
2.9M    total

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
76085 esiefker      1  20    0   139M 76464K sbwait  0   1:28   0.91% perl

Perl is spending all its time in 'sbwait'.  (FreeBSD 11)  Any ideas on how to improve this?

> sessionInfo()

Would be included but the forum is telling me:

"Language "af" is not one of the supported languages ['en']!"
 

ensembl ensembldb • 1.2k views
ADD COMMENT
0
Entering edit mode

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: amd64-portbld-freebsd11.0 (64-bit)
Running under: FreeBSD bio 11.0-STABLE FreeBSD 11.0-STABLE #0 r321665+25fe8ba8d06(freenas/11.0-stable): Mon Sep 25 06:24:11 UTC 2017     root@gauntlet:/freenas-11-releng/freenas/_BE/objs/freenas-11-releng/freenas/_BE/os/sys/FreeNAS.amd64  amd64

Matrix products: default
LAPACK: /usr/local/lib/R/lib/libRlapack.so

locale:
[1] C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] ensembldb_2.0.4        AnnotationFilter_1.0.0 GenomicFeatures_1.28.5
 [4] AnnotationDbi_1.38.2   Biobase_2.36.2         GenomicRanges_1.28.6
 [7] GenomeInfoDb_1.12.3    IRanges_2.10.5         S4Vectors_0.14.7
[10] BiocGenerics_0.22.1

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.13                  BiocInstaller_1.26.1
 [3] AnnotationHub_2.8.3           compiler_3.4.2
 [5] XVector_0.16.0                ProtGenerics_1.8.0
 [7] bitops_1.0-6                  tools_3.4.2
 [9] zlibbioc_1.22.0               biomaRt_2.32.1
[11] digest_0.6.12                 bit_1.1-12
[13] RSQLite_2.0                   memoise_1.1.0
[15] tibble_1.3.4                  lattice_0.20-35
[17] rlang_0.1.2                   Matrix_1.2-11
[19] shiny_1.0.5                   DelayedArray_0.2.7
[21] DBI_0.7                       curl_3.0
[23] yaml_2.1.14                   GenomeInfoDbData_0.99.0
[25] httr_1.3.1                    rtracklayer_1.36.6
[27] Biostrings_2.44.2             bit64_0.9-7
[29] grid_3.4.2                    R6_2.2.2
[31] XML_3.98-1.9                  BiocParallel_1.10.1
[33] blob_1.1.0                    htmltools_0.3.6
[35] Rsamtools_1.28.0              matrixStats_0.52.2
[37] GenomicAlignments_1.12.2      SummarizedExperiment_1.6.5
[39] xtable_1.8-2                  mime_0.5
[41] interactiveDisplayBase_1.14.0 httpuv_1.3.5
[43] RCurl_1.95-4.8                lazyeval_0.2.0
>

 

ADD REPLY
0
Entering edit mode
Johannes Rainer ★ 2.0k
@johannes-rainer-6987
Last seen 14 days ago
Italy

Hi Ed,

using the perl scripts and the Ensembl perl API is indeed very slow. It is even worse if you fetch the data from the main Ensembl MySQL servers. What I am doing with each new Ensembl release is to download first the MySQL database dumps from Ensembl and install them locally. This improves performance, but it will still take several hours to generate the full data base.

The good news: you don't need to build the package yourself. With each new Ensembl release I am building EnsDb databases for each species and I'm adding them to AnnotationHub. You will need Bioconductor version 3.6 (just released) and you can download and use the Ensembl 90 EnsDb for mouse from there:

> library(AnnotationHub)
> query(AnnotationHub(), "EnsDb.Mmusculus")
snapshotDate(): 2017-10-27
AnnotationHub with 4 records
# snapshotDate(): 2017-10-27
# $dataprovider: Ensembl
# $species: Mus Musculus
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH53222"]]'

            title                            
  AH53222 | Ensembl 87 EnsDb for Mus Musculus
  AH53726 | Ensembl 88 EnsDb for Mus Musculus
  AH56691 | Ensembl 89 EnsDb for Mus Musculus
  AH57770 | Ensembl 90 EnsDb for Mus Musculus
> edb <- AnnotationHub()[["AH57770"]]
snapshotDate(): 2017-10-27
require(“ensembldb”)
downloading from 'https://annotationhub.bioconductor.org/fetch/64508'
retrieving 1 resource
  |======================================================================| 100%
> edb
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.1
|Creation time: Sat Aug 26 22:06:14 2017
|ensembl_version: 90
|ensembl_host: localhost
|Organism: mus_musculus
|taxonomy_id: 10090
|genome_build: GRCm38
|DBSCHEMAVERSION: 2.1
| No. of genes: 53110.
| No. of transcripts: 132305.
|Protein data available.

If you still want/need to build it on your own you have to be patient - it is very slow.

hope this helped

cheers, jo

ADD COMMENT

Login before adding your answer.

Traffic: 456 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6