Question

Retrieve gene description with ensembldb

0

Entering edit mode

@antonio-miguel-de-jesus-domingues-5182

Last seen 13 months ago

Germany

I have been slowly replacing biomaRt queries with the annotations in ensembldb. Basically to retrieve gene symbols, and gene locations, by querying with ensembl IDs. One thing that I could not find was how to retrieve "gene descriptions". Basic example on how to do it with biomaRt:

library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")
IDs <- c("BRCA2","BRAF")
genedesc <- getBM(attributes=c('external_gene_name','description'), filters = 'external_gene_name', values = IDs, mart =ensembl)
genedesc

Is there a way of doing with ensembldb? Or any other alternative other than querying Biomart?

Thank you.

biomaRt ensembldb AnnotationData • 2.5k views

ADD COMMENT • link updated 3.9 years ago by Johannes Rainer ★ 2.1k • written 3.9 years ago by António Miguel de Jesus Domingues ▴ 510

score 2 · Accepted Answer · 2021-04-09

Sure, that information is also available within ensembldb's EnsDb databases. Ideally, you should get them from AnnotationHub as shown in the example below (thus you can get the EnsDb database for each species for any Ensembl release).

First we're getting the EnsDb for homo sapiens and Ensembl release 100:

> library(AnnotationHub)
> ah <- AnnotationHub()
snapshotDate(): 2020-10-27
> query(ah, "EnsDb.Hsapiens.v100")
AnnotationHub with 1 record
# snapshotDate(): 2020-10-27
# names(): AH79689
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2020-04-27
# $title: Ensembl 100 EnsDb for Homo sapiens
# $description: Gene and protein annotations for Homo sapiens based on Ensem...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("100", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl", "Gene",
#   "Protein", "Transcript") 
# retrieve record with 'object[["AH79689"]]' 
> edb <- ah[["AH79689"]]
loading from cache

You can then get gene annotations using the genes method:

> genes(edb)
GRanges object with 68008 ranges and 8 metadata columns:
                  seqnames            ranges strand |         gene_id
                     <Rle>         <IRanges>  <Rle> |     <character>
  ENSG00000223972        1       11869-14409      + | ENSG00000223972
  ENSG00000227232        1       14404-29570      - | ENSG00000227232
              ...      ...               ...    ... .             ...
  ENSG00000231514        Y 26626520-26627159      - | ENSG00000231514
  ENSG00000235857        Y 56855244-56855488      + | ENSG00000235857
                    gene_name           gene_biotype seq_coord_system
                  <character>            <character>      <character>
  ENSG00000223972     DDX11L1 transcribed_unproces..       chromosome
  ENSG00000227232      WASH7P unprocessed_pseudogene       chromosome
              ...         ...                    ...              ...
  ENSG00000231514      CCNQP2   processed_pseudogene       chromosome
  ENSG00000235857     CTBP2P1   processed_pseudogene       chromosome
                             description   gene_id_version      symbol entrezid
                             <character>       <character> <character>   <list>
  ENSG00000223972 DEAD/H-box helicase .. ENSG00000223972.5     DDX11L1     <NA>
  ENSG00000227232 WASP family homolog .. ENSG00000227232.5      WASH7P     <NA>
              ...                    ...               ...         ...      ...
  ENSG00000231514 CCNQ pseudogene 2 [S.. ENSG00000231514.1      CCNQP2     <NA>
  ENSG00000235857 CTBP2 pseudogene 1 [.. ENSG00000235857.1     CTBP2P1     <NA>
  -------
  seqinfo: 454 sequences from GRCh38 genome

The gene description if available in metadata column "description". Note also that you could retrieve the results as a data.frame by setting parameter return.type = "data.frame".