Question

EnsemblDb: how to retrieve gene name / description?

0

Entering edit mode

Guido Hooiveld ★ 4.0k

@guido-hooiveld-2020

Last seen 8 hours ago

Wageningen University, Wageningen, the …

I have an ENSEMBL-based RNA-seq dataset that I would like to annotate. For that I use the library EnsDb.Mmusculus.v79. However, I am somehow not able to retrieve the gene name (description). Based on the retrievable annotation info, I assumed this would be accessible through the argument columns = "GENENAME", but this turns out not to be the case (that rather is the same as "SYMBOL"). See example code below. Am I missing something obvious? Any advise on how to do this would be appreciated.

> library(EnsDb.Mmusculus.v79)
>
> keys <- keys(EnsDb.Mmusculus.v79)[1:15]
> anno.result <- select(EnsDb.Mmusculus.v79, keys=keys, columns=c("GENEID","SYMBOL","GENENAME","ENTREZID"),keytype="GENEID")
>
> anno.result
               GENEID SYMBOL GENENAME ENTREZID
1  ENSMUSG00000000001  Gnai3    Gnai3    14679
2  ENSMUSG00000000003   Pbsn     Pbsn    54192
3  ENSMUSG00000000028  Cdc45    Cdc45    12544
4  ENSMUSG00000000031    H19      H19         
5  ENSMUSG00000000037  Scml2    Scml2   107815
6  ENSMUSG00000000049   Apoh     Apoh    11818
7  ENSMUSG00000000056   Narf     Narf    67608
8  ENSMUSG00000000058   Cav2     Cav2    12390
9  ENSMUSG00000000078   Klf6     Klf6    23849
10 ENSMUSG00000000085  Scmh1    Scmh1         
11 ENSMUSG00000000088  Cox5a    Cox5a    12858
12 ENSMUSG00000000093   Tbx2     Tbx2    21385
13 ENSMUSG00000000094   Tbx4     Tbx4    21387
14 ENSMUSG00000000103   Zfy2     Zfy2         
15 ENSMUSG00000000120   Ngfr     Ngfr    18053
>

For the first gene Gnai3 (ENSMUSG00000000001), description should be "guanine nucleotide binding protein (G protein), alpha inhibiting 3" (according to ENSEMBL website here).

> columns(EnsDb.Mmusculus.v79)
 [1] "ENTREZID"            "EXONID"              "EXONIDX"            
 [4] "EXONSEQEND"          "EXONSEQSTART"        "GENEBIOTYPE"        
 [7] "GENEID"              "GENENAME"            "GENESEQEND"         
[10] "GENESEQSTART"        "INTERPROACCESSION"   "ISCIRCULAR"         
[13] "PROTDOMEND"          "PROTDOMSTART"        "PROTEINDOMAINID"    
[16] "PROTEINDOMAINSOURCE" "PROTEINID"           "PROTEINSEQUENCE"    
[19] "SEQCOORDSYSTEM"      "SEQLENGTH"           "SEQNAME"            
[22] "SEQSTRAND"           "SYMBOL"              "TXBIOTYPE"          
[25] "TXCDSSEQEND"         "TXCDSSEQSTART"       "TXID"               
[28] "TXNAME"              "TXSEQEND"            "TXSEQSTART"         
[31] "UNIPROTDB"           "UNIPROTID"           "UNIPROTMAPPINGTYPE"
>

> sessionInfo()
R version 3.4.0 Patched (2017-05-10 r72670)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] EnsDb.Mmusculus.v79_2.1.0 ensembldb_2.0.4          
 [3] AnnotationFilter_1.0.0    GenomicFeatures_1.28.4   
 [5] AnnotationDbi_1.38.2      Biobase_2.36.2           
 [7] GenomicRanges_1.28.4      GenomeInfoDb_1.12.2      
 [9] IRanges_2.10.2            S4Vectors_0.14.3         
[11] BiocGenerics_0.22.0      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12                  BiocInstaller_1.26.0         
 [3] AnnotationHub_2.8.2           compiler_3.4.0               
 [5] XVector_0.16.0                ProtGenerics_1.8.0           
 [7] bitops_1.0-6                  tools_3.4.0                  
 [9] zlibbioc_1.22.0               biomaRt_2.32.1               
[11] digest_0.6.12                 bit_1.1-12                   
[13] RSQLite_2.0                   memoise_1.1.0                
[15] tibble_1.3.3                  lattice_0.20-35              
[17] pkgconfig_2.0.1               rlang_0.1.2                  
[19] Matrix_1.2-10                 shiny_1.0.3                  
[21] DelayedArray_0.2.7            DBI_0.7                      
[23] curl_2.8.1                    yaml_2.1.14                  
[25] GenomeInfoDbData_0.99.0       httr_1.2.1                   
[27] rtracklayer_1.36.4            Biostrings_2.44.2            
[29] bit64_0.9-7                   grid_3.4.0                   
[31] R6_2.2.2                      XML_3.98-1.9                 
[33] BiocParallel_1.10.1           blob_1.1.0                   
[35] htmltools_0.3.6               Rsamtools_1.28.0             
[37] matrixStats_0.52.2            GenomicAlignments_1.12.1     
[39] SummarizedExperiment_1.6.3    xtable_1.8-2                 
[41] mime_0.5                      interactiveDisplayBase_1.14.0
[43] httpuv_1.3.5                  RCurl_1.95-4.8               
[45] lazyeval_0.2.0               
>

EnsDb.Mmusculus.v79 ensembldb annotation • 4.7k views

ADD COMMENT • link updated 7.0 years ago by Johannes Rainer ★ 2.1k • written 7.0 years ago by Guido Hooiveld ★ 4.0k

0

Entering edit mode

James W. MacDonald 66k

@james-w-macdonald-5106

Last seen 2 hours ago

United States

To my knowledge you aren't missing anything obvious - the gene names aren't part of the EnsDb. You can use biomaRt to get them.

ADD COMMENT • link 7.0 years ago James W. MacDonald 66k

0

Entering edit mode

Got it! But IMHO the attribute "GENENAME" (and maybe others) is then wrong/confusing/should be removed from the database.

ADD REPLY • link 7.0 years ago Guido Hooiveld ★ 4.0k

score 2 · Accepted Answer · 2017-08-15

2

Entering edit mode

Johannes Rainer ★ 2.1k

@johannes-rainer-6987

Last seen 8 weeks ago

Italy

Dear Guido,

the naming of the columns in an EnsDb might be misleading, but they reflect the naming convention from Ensembl. Each gene there has a name which is stored in columns gene_name (or GENENAME) in the EnsDb. In addition there is the description field for each gene, but I didn't put them in the EnsDb prior to version 2.1 of these databases. In more recent EnsDb databases there is now also a column DESCRIPTION that you can query. You can get such databases from AnnotationHub:

> library(AnnotationHub)
> query(ah, "EnsDb.Mmusculus")
AnnotationHub with 3 records
# snapshotDate(): 2017-07-11
# $dataprovider: Ensembl
# $species: Mus Musculus
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH53222"]]'

            title                            
  AH53222 | Ensembl 87 EnsDb for Mus Musculus
  AH53726 | Ensembl 88 EnsDb for Mus Musculus
  AH56691 | Ensembl 89 EnsDb for Mus Musculus

> edb <- ah[["AH53222"]]
loading from cache '/Users/jo//.AnnotationHub/59960'
> columns(edb)
 [1] "DESCRIPTION"         "ENTREZID"            "EXONID"             
 [4] "EXONIDX"             "EXONSEQEND"          "EXONSEQSTART"       
 [7] "GENEBIOTYPE"         "GENEID"              "GENENAME"           
[10] "GENESEQEND"          "GENESEQSTART"        "INTERPROACCESSION"  
[13] "ISCIRCULAR"          "PROTDOMEND"          "PROTDOMSTART"       
[16] "PROTEINDOMAINID"     "PROTEINDOMAINSOURCE" "PROTEINID"          
[19] "PROTEINSEQUENCE"     "SEQCOORDSYSTEM"      "SEQLENGTH"          
[22] "SEQNAME"             "SEQSTRAND"           "SYMBOL"             
[25] "TXBIOTYPE"           "TXCDSSEQEND"         "TXCDSSEQSTART"      
[28] "TXID"                "TXNAME"              "TXSEQEND"           
[31] "TXSEQSTART"          "TXSUPPORTLEVEL"      "UNIPROTDB"          
[34] "UNIPROTID"           "UNIPROTMAPPINGTYPE"
> genes(edb, filter = ~ gene_id == "ENSMUSG00000000001", columns = c("description", "gene_name"), return.type = "data.frame")
                                                                                           description
1 guanine nucleotide binding protein (G protein), alpha inhibiting 3 [Source:MGI Symbol;Acc:MGI:95773]
  gene_name            gene_id
1     Gnai3 ENSMUSG00000000001

In general I suggest users to fetch EnsDb databases from AnnotationHub. Through AnnotationHub I provide EnsDbs for all species from Ensemb and I'll also add EnsDb databases for each new Ensembl release.

cheers, jo

ADD COMMENT • link 7.0 years ago Johannes Rainer ★ 2.1k

0

Entering edit mode

Thanks Jo for your explanation and code! I will follow your recommendation to use AnnotationHub. However, a quick try revealed that I didn't get it to work; I am using R-3.4.0, but apparently R-3.4.1 is required... usually this isn't such strict problem. Anyway, I hope to do this soon. Or do I misunderstand the error?

>library(AnnotationHub)
>ah = AnnotationHub()
>query(ah, "EnsDb.Mmusculus")
AnnotationHub with 2 records
# snapshotDate(): 2017-04-25
# $dataprovider: Ensembl
# $species: Mus Musculus
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH53222"]]'

            title                           
  AH53222 | Ensembl 87 EnsDb for Mus Musculus
  AH53726 | Ensembl 88 EnsDb for Mus Musculus

> edb <- ah[["AH53222"]]
require(“ensembldb”)
Error: failed to load resource
  name: AH53222
  title: Ensembl 87 EnsDb for Mus Musculus
  reason: require(“ensembldb”) failed: package ‘ensembldb’ was built under R version 3.4.1
>

> sessionInfo()
R version 3.4.0 Patched (2017-05-10 r72670)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252  
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                         
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods 
[8] base     

other attached packages:
[1] AnnotationHub_2.8.2 BiocGenerics_0.22.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.12                  AnnotationDbi_1.38.2        
[3] IRanges_2.10.2                bit_1.1-12                  
[5] xtable_1.8-2                  R6_2.2.2                    
[7] rlang_0.1.2                   blob_1.1.0                  
[9] httr_1.2.1                    tools_3.4.0                 
[11] Biobase_2.36.2                DBI_0.7                     
[13] htmltools_0.3.6               yaml_2.1.14                 
[15] bit64_0.9-7                   digest_0.6.12               
[17] tibble_1.3.3                  interactiveDisplayBase_1.14.0
[19] shiny_1.0.3                   S4Vectors_0.14.3            
[21] curl_2.8.1                    memoise_1.1.0               
[23] RSQLite_2.0                   mime_0.5                    
[25] compiler_3.4.0                BiocInstaller_1.26.0        
[27] stats4_3.4.0                  httpuv_1.3.5                
[29] pkgconfig_2.0.1

ADD REPLY • link 7.0 years ago Guido Hooiveld ★ 4.0k

0

Entering edit mode

Problem solved! It turned out the library ensembldb was not loaded.... After explicitly doing this, everything worked as expected!

> library(AnnotationHub)
> library(ensembldb)
> ah = AnnotationHub()
snapshotDate(): 2017-04-25
> query(ah, "EnsDb.Mmusculus")
AnnotationHub with 2 records
# snapshotDate(): 2017-04-25
# $dataprovider: Ensembl
# $species: Mus Musculus
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH53222"]]'

            title                            
  AH53222 | Ensembl 87 EnsDb for Mus Musculus
  AH53726 | Ensembl 88 EnsDb for Mus Musculus
> edb <- ah[["AH53222"]]
> genes(edb, filter = ~ gene_id == "ENSMUSG00000000001", columns = c("description", "gene_name"), return.type = "data.frame")
                                                                                           description
1 guanine nucleotide binding protein (G protein), alpha inhibiting 3 [Source:MGI Symbol;Acc:MGI:95773]
  gene_name            gene_id
1     Gnai3 ENSMUSG00000000001
>

ADD REPLY • link 7.0 years ago Guido Hooiveld ★ 4.0k

1

Entering edit mode

Dear Johannes, small follow-up question/feature request:

I noticed that when using the select() interface with the EnsDb, Gene IDs that are not current anymore (this IDs that are not present in the EnsDb) are not returned at all, which in such cases results in a shorter output than input list. Although I can work my way around this, it would be nice if the output would still contain these unmapped genes (with <na>'s), analogous to the output of select() used on e.g. the Org.Hs.eg.db. See below for example.

FYI: I noticed this when analyzing a mouse RNA-seq dataset downloaded from GEO, but there nor in the paper it is stated which version of Ensembl was used for mapping. I therefore just used a recent EnsDb version available through the AnnotationHub.

> library(AnnotationHub)
> library(ensembldb)
> ah = AnnotationHub()
snapshotDate(): 2017-04-25
> edb <- ah[["AH53222"]] #  AH53222 | Ensembl 87 EnsDb for Mus Musculus
>
> geneIDs <- c("ENSMUSG00000000001", "ENSMUSG00000000028", "ENSMUSG00000001379") #3 genes
select(edb, keys=geneIDs, columns= c("GENEID","DESCRIPTION", "GENENAME"),keytype="GENEID")
              GENEID
1 ENSMUSG00000000001
2 ENSMUSG00000000028
                                                                                           DESCRIPTION
1 guanine nucleotide binding protein (G protein), alpha inhibiting 3 [Source:MGI Symbol;Acc:MGI:95773]
2                                           cell division cycle 45 [Source:MGI Symbol;Acc:MGI:1338073]
  GENENAME
1    Gnai3
2    Cdc45
> 
# please note that the output contains only 2 entries, because no annotation info could be obtained for the last ID (ENSMUSG00000001379, which has been retired).

> library(org.Hs.eg.db)

> probeids <-c("1", "2", "14679") #again 3 genes
> select(org.Hs.eg.db, keys=probeids, columns=c("ENTREZID","SYMBOL","GENENAME"),keytype="ENTREZID")
'select()' returned 1:1 mapping between keys and columns
  ENTREZID SYMBOL               GENENAME
1        1   A1BG alpha-1-B glycoprotein
2        2    A2M  alpha-2-macroglobulin
3    14679   <NA>                   <NA>
>
# please note that the output still contains 3 entries, although no annotation info could be obtained for the last ID (which actually is a mouse gene).

ADD REPLY • link 6.9 years ago Guido Hooiveld ★ 4.0k

0

Entering edit mode

Dear Guido, thanks for your suggestion. I've opened an issue over at the github repo https://github.com/jotsetung/ensembldb/issues/53 and will think of a possible solution.

ADD REPLY • link 6.9 years ago Johannes Rainer ★ 2.1k

0

Entering edit mode

AnnotationHub shouldn't fail on this warning, and this has been fixed in version 2.9.9, available in 'devel' in a day or so and in the next Bioconductor (3.6) release in October, 2017.

ADD REPLY • link 6.9 years ago Martin Morgan 25k