Question

Best alternative annotation package to BiomaRt

0

Entering edit mode

m93 ▴ 10

@m93-21227

Last seen 5.7 years ago

I have developing an RNA-seq pipeline and one of the steps of this is to annotation my genes. I am using Ensembl IDs and I would like my query to return me "ensemblgeneid", "hgncid", "hgncsymbol","description". Indeed, I need this information for later performing Gene set enrichment analysis.

I have been having problems with BiomaRT being slow and timing out, as described in this post: https://support.bioconductor.org/p/122412/#122533

So I am thinking of changing to a different annotation package. The appear to be many out there: https://www.bioconductor.org/packages/release/data/annotation/

I need a package that would have b38 data for humans and would have "ensemblgeneid", "hgncid", "hgncsymbol","description" information. I have found the following (below)

Package  Maintainer  Title
EnsDb.Hsapiens.v75  Johannes Rainer     Ensembl based annotation package
EnsDb.Hsapiens.v79  Johannes Rainer     Ensembl based annotation package
EnsDb.Hsapiens.v86  Johannes Rainer     Ensembl based annotation package

However, I am concerned these packages may be outdated. Would anyone recommend a particular up to date package to replace BiomaRt? Has anyone had any experience similar to this?

Thanks.

R biomaRt Ensembl rnaseq • 4.1k views

ADD COMMENT • link 5.7 years ago m93 ▴ 10

2

Entering edit mode

biomaRt is supreme for annotation. If you have issues with your connection, then you could obtain an entire table via biomaRt on a once off basis, and then save and date-stamp it. You can then re-use that for annotating. This may actually be better because then you have version control in place.

For example, with this code, you can obtain a table that links Affy U133 probe IDs to ENSG ID, Biotype, and official gene symbols:

require("biomaRt")
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset("hsapiens_gene_ensembl", mart)
annotLookup <- getBM(
  mart = mart,
  attributes = c(
    "affy_hg_u133_plus_2",
    "ensembl_gene_id",
    "gene_biotype",
    "external_gene_name"),
  uniqueRows=TRUE)

dim(annotLookup)
[1] 107190      4
head(annotLookup)
  affy_hg_u133_plus_2 ensembl_gene_id   gene_biotype external_gene_name
1                     ENSG00000210049        Mt_tRNA              MT-TF
2                     ENSG00000211459        Mt_rRNA            MT-RNR1
3                     ENSG00000210077        Mt_tRNA              MT-TV
4                     ENSG00000210082        Mt_rRNA            MT-RNR2
5                     ENSG00000209082        Mt_tRNA             MT-TL1
6        1553551_s_at ENSG00000198888 protein_coding             MT-ND1

ADD REPLY • link 5.7 years ago Kevin Blighe ★ 4.0k

3

Entering edit mode

BiocFileCache might help implement this strategy, e.g., the use case 2.1 Local cache of an internet resource.

ADD REPLY • link 5.7 years ago Martin Morgan 25k

1

Entering edit mode

The caching implemented in biomaRt devel makes use of BiocFileCache to store results tables. It's really useful.

ADD REPLY • link 5.7 years ago Mike Smith ★ 6.6k

1

Entering edit mode

Thanks, I ended up using your solution and indeed it is much much faster to not have a specific query. Thanks!

ADD REPLY • link 5.7 years ago m93 ▴ 10

0

Entering edit mode

That's weird, somehow this works now. Before, O was querying a specific subset with the getBM() function but when I didn't query a specific subset like above, it worked somehow..

But I am no longer getting the usual message: Batch submitting query [=======>-----------------------------------------------------]

Is that normal? I'm confused as to why this suddenly works..

ADD REPLY • link 5.7 years ago m93 ▴ 10

1

Entering edit mode

You don't get the "batch submission" query because the batches are defined by the values you provide. If you don't provide this argument then it can't be split into batches.

Ensembl recommend a maximum of 500 values for each filter, otherwise queries can sometimes silently timeout on the server side and you get a truncated result table with no indication that this happened. There are examples of this if you search this forum, hence why the batch submission was introduced in the first place.

ADD REPLY • link 5.7 years ago Mike Smith ★ 6.6k

1

Entering edit mode

You can give the newest version of biomaRt a try, which will cache results for next time you run them, and store the temporary results for resuming a query that fails. You can install with

BiocManager::install("grimbough/biomaRt")

Bear in mind that these features are very new, so if it doesn't work properly please let me know here or even better raise an issue on GitHub.

ADD REPLY • link 5.7 years ago Mike Smith ★ 6.6k

1

Entering edit mode

You are right, the EnsDb databases that I provide as a package are outdated. But you can get EnsDb databases from AnnotationHub for all Ensembl releases from version 87 on:

> library(AnnotationHub)
> library(ensembldb)
> ah <- AnnotationHub()
snapshotDate(): 2019-05-02
> query(ah, "Hsapiens.v96")
AnnotationHub with 1 record
# snapshotDate(): 2019-05-02 
# names(): AH69187
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2019-04-15
# $title: Ensembl 96 EnsDb for Homo sapiens
# $description: Gene and protein annotations for Homo sapiens based on Ensem...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("96", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl", "Gene",
#   "Protein", "Transcript") 
# retrieve record with 'object[["AH69187"]]' 
> edb <- ah[["AH69187"]]
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

loading from cache 
    'AH69187 : 75933'
> genes(edb)
GRanges object with 65868 ranges and 8 metadata columns:
                  seqnames            ranges strand |         gene_id
                     <Rle>         <IRanges>  <Rle> |     <character>
  ENSG00000223972        1       11869-14409      + | ENSG00000223972
  ENSG00000227232        1       14404-29570      - | ENSG00000227232
              ...      ...               ...    ... .             ...
  ENSG00000231514        Y 26626520-26627159      - | ENSG00000231514
  ENSG00000235857        Y 56855244-56855488      + | ENSG00000235857
                    gene_name                       gene_biotype
                  <character>                        <character>
  ENSG00000223972     DDX11L1 transcribed_unprocessed_pseudogene
  ENSG00000227232      WASH7P             unprocessed_pseudogene
              ...         ...                                ...
  ENSG00000231514      CCNQP2               processed_pseudogene
  ENSG00000235857     CTBP2P1               processed_pseudogene
                  seq_coord_system
                       <character>
  ENSG00000223972       chromosome
  ENSG00000227232       chromosome
              ...              ...
  ENSG00000231514       chromosome
  ENSG00000235857       chromosome
                                                                                    description
                                                                                    <character>
  ENSG00000223972             DEAD/H-box helicase 11 like 1 [Source:HGNC Symbol;Acc:HGNC:37102]
  ENSG00000227232  WAS protein family homolog 7, pseudogene [Source:HGNC Symbol;Acc:HGNC:38034]
              ...                                                                           ...
  ENSG00000231514                         CCNQ pseudogene 2 [Source:HGNC Symbol;Acc:HGNC:38436]
  ENSG00000235857 C-terminal binding protein 2 pseudogene 1 [Source:HGNC Symbol;Acc:HGNC:23940]
                    gene_id_version      symbol entrezid
                        <character> <character>   <list>
  ENSG00000223972 ENSG00000223972.5     DDX11L1       NA
  ENSG00000227232 ENSG00000227232.5      WASH7P       NA
              ...               ...         ...      ...
  ENSG00000231514 ENSG00000231514.1      CCNQP2       NA
  ENSG00000235857 ENSG00000235857.1     CTBP2P1       NA
  -------
  seqinfo: 423 sequences from GRCh38 genome

So, there is the "gene_id" field (Ensembl gene ID), "symbol" (HGNC Symbol), "description" and "entrezid" - I don't provide the HGNC ID directly, but you could extract that from the description.

To get an overview of all EnsDb databases that are available:

> query(ah, "EnsDb")
AnnotationHub with 974 records
# snapshotDate(): 2019-05-02 
# $dataprovider: Ensembl
# $species: Ailuropoda melanoleuca, Anolis carolinensis, Astyanax mexicanus,...
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH53185"]]' 

            title                                      
  AH53185 | Ensembl 87 EnsDb for Anolis Carolinensis   
  AH53186 | Ensembl 87 EnsDb for Ailuropoda Melanoleuca
  ...       ...                                        
  AH69288 | Ensembl 96 EnsDb for Xenopus tropicalis    
  AH69289 | Ensembl 96 EnsDb for Zonotrichia albicollis

ADD REPLY • link 5.7 years ago Johannes Rainer ★ 2.1k