I have an ENSEMBL-based RNA-seq dataset that I would like to annotate. For that I use the library EnsDb.Mmusculus.v79. However, I am somehow not able to retrieve the gene name (description). Based on the retrievable annotation info, I assumed this would be accessible through the argument columns = "GENENAME", but this turns out not to be the case (that rather is the same as "SYMBOL"). See example code below. Am I missing something obvious? Any advise on how to do this would be appreciated.
> library(EnsDb.Mmusculus.v79)
>
> keys <- keys(EnsDb.Mmusculus.v79)[1:15]
> anno.result <- select(EnsDb.Mmusculus.v79, keys=keys, columns=c("GENEID","SYMBOL","GENENAME","ENTREZID"),keytype="GENEID")
>
> anno.result
GENEID SYMBOL GENENAME ENTREZID
1 ENSMUSG00000000001 Gnai3 Gnai3 14679
2 ENSMUSG00000000003 Pbsn Pbsn 54192
3 ENSMUSG00000000028 Cdc45 Cdc45 12544
4 ENSMUSG00000000031 H19 H19
5 ENSMUSG00000000037 Scml2 Scml2 107815
6 ENSMUSG00000000049 Apoh Apoh 11818
7 ENSMUSG00000000056 Narf Narf 67608
8 ENSMUSG00000000058 Cav2 Cav2 12390
9 ENSMUSG00000000078 Klf6 Klf6 23849
10 ENSMUSG00000000085 Scmh1 Scmh1
11 ENSMUSG00000000088 Cox5a Cox5a 12858
12 ENSMUSG00000000093 Tbx2 Tbx2 21385
13 ENSMUSG00000000094 Tbx4 Tbx4 21387
14 ENSMUSG00000000103 Zfy2 Zfy2
15 ENSMUSG00000000120 Ngfr Ngfr 18053
>
For the first gene Gnai3 (ENSMUSG00000000001), description should be "guanine nucleotide binding protein (G protein), alpha inhibiting 3" (according to ENSEMBL website here).
> columns(EnsDb.Mmusculus.v79) [1] "ENTREZID" "EXONID" "EXONIDX" [4] "EXONSEQEND" "EXONSEQSTART" "GENEBIOTYPE" [7] "GENEID" "GENENAME" "GENESEQEND" [10] "GENESEQSTART" "INTERPROACCESSION" "ISCIRCULAR" [13] "PROTDOMEND" "PROTDOMSTART" "PROTEINDOMAINID" [16] "PROTEINDOMAINSOURCE" "PROTEINID" "PROTEINSEQUENCE" [19] "SEQCOORDSYSTEM" "SEQLENGTH" "SEQNAME" [22] "SEQSTRAND" "SYMBOL" "TXBIOTYPE" [25] "TXCDSSEQEND" "TXCDSSEQSTART" "TXID" [28] "TXNAME" "TXSEQEND" "TXSEQSTART" [31] "UNIPROTDB" "UNIPROTID" "UNIPROTMAPPINGTYPE" >
> sessionInfo() R version 3.4.0 Patched (2017-05-10 r72670) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] EnsDb.Mmusculus.v79_2.1.0 ensembldb_2.0.4 [3] AnnotationFilter_1.0.0 GenomicFeatures_1.28.4 [5] AnnotationDbi_1.38.2 Biobase_2.36.2 [7] GenomicRanges_1.28.4 GenomeInfoDb_1.12.2 [9] IRanges_2.10.2 S4Vectors_0.14.3 [11] BiocGenerics_0.22.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.12 BiocInstaller_1.26.0 [3] AnnotationHub_2.8.2 compiler_3.4.0 [5] XVector_0.16.0 ProtGenerics_1.8.0 [7] bitops_1.0-6 tools_3.4.0 [9] zlibbioc_1.22.0 biomaRt_2.32.1 [11] digest_0.6.12 bit_1.1-12 [13] RSQLite_2.0 memoise_1.1.0 [15] tibble_1.3.3 lattice_0.20-35 [17] pkgconfig_2.0.1 rlang_0.1.2 [19] Matrix_1.2-10 shiny_1.0.3 [21] DelayedArray_0.2.7 DBI_0.7 [23] curl_2.8.1 yaml_2.1.14 [25] GenomeInfoDbData_0.99.0 httr_1.2.1 [27] rtracklayer_1.36.4 Biostrings_2.44.2 [29] bit64_0.9-7 grid_3.4.0 [31] R6_2.2.2 XML_3.98-1.9 [33] BiocParallel_1.10.1 blob_1.1.0 [35] htmltools_0.3.6 Rsamtools_1.28.0 [37] matrixStats_0.52.2 GenomicAlignments_1.12.1 [39] SummarizedExperiment_1.6.3 xtable_1.8-2 [41] mime_0.5 interactiveDisplayBase_1.14.0 [43] httpuv_1.3.5 RCurl_1.95-4.8 [45] lazyeval_0.2.0 >

Thanks Jo for your explanation and code! I will follow your recommendation to use
AnnotationHub. However, a quick try revealed that I didn't get it to work; I am using R-3.4.0, but apparently R-3.4.1 is required... usually this isn't such strict problem. Anyway, I hope to do this soon. Or do I misunderstand the error?>library(AnnotationHub) >ah = AnnotationHub() >query(ah, "EnsDb.Mmusculus") AnnotationHub with 2 records # snapshotDate(): 2017-04-25 # $dataprovider: Ensembl # $species: Mus Musculus # $rdataclass: EnsDb # additional mcols(): taxonomyid, genome, description, # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, # rdatapath, sourceurl, sourcetype # retrieve records with, e.g., 'object[["AH53222"]]' title AH53222 | Ensembl 87 EnsDb for Mus Musculus AH53726 | Ensembl 88 EnsDb for Mus Musculus > edb <- ah[["AH53222"]] require(“ensembldb”) Error: failed to load resource name: AH53222 title: Ensembl 87 EnsDb for Mus Musculus reason: require(“ensembldb”) failed: package ‘ensembldb’ was built under R version 3.4.1 >Problem solved! It turned out the library
ensembldbwas not loaded.... After explicitly doing this, everything worked as expected!> library(AnnotationHub) > library(ensembldb) > ah = AnnotationHub() snapshotDate(): 2017-04-25 > query(ah, "EnsDb.Mmusculus") AnnotationHub with 2 records # snapshotDate(): 2017-04-25 # $dataprovider: Ensembl # $species: Mus Musculus # $rdataclass: EnsDb # additional mcols(): taxonomyid, genome, description, # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, # rdatapath, sourceurl, sourcetype # retrieve records with, e.g., 'object[["AH53222"]]' title AH53222 | Ensembl 87 EnsDb for Mus Musculus AH53726 | Ensembl 88 EnsDb for Mus Musculus > edb <- ah[["AH53222"]] > genes(edb, filter = ~ gene_id == "ENSMUSG00000000001", columns = c("description", "gene_name"), return.type = "data.frame") description 1 guanine nucleotide binding protein (G protein), alpha inhibiting 3 [Source:MGI Symbol;Acc:MGI:95773] gene_name gene_id 1 Gnai3 ENSMUSG00000000001 >Dear Johannes, small follow-up question/feature request:
I noticed that when using the
select()interface with theEnsDb, Gene IDs that are not current anymore (this IDs that are not present in theEnsDb) are not returned at all, which in such cases results in a shorter output than input list. Although I can work my way around this, it would be nice if the output would still contain these unmapped genes (with <na>'s), analogous to the output ofselect()used on e.g. theOrg.Hs.eg.db. See below for example.FYI: I noticed this when analyzing a mouse RNA-seq dataset downloaded from GEO, but there nor in the paper it is stated which version of Ensembl was used for mapping. I therefore just used a recent EnsDb version available through the AnnotationHub.
> library(AnnotationHub) > library(ensembldb) > ah = AnnotationHub() snapshotDate(): 2017-04-25 > edb <- ah[["AH53222"]] # AH53222 | Ensembl 87 EnsDb for Mus Musculus > > geneIDs <- c("ENSMUSG00000000001", "ENSMUSG00000000028", "ENSMUSG00000001379") #3 genes select(edb, keys=geneIDs, columns= c("GENEID","DESCRIPTION", "GENENAME"),keytype="GENEID") GENEID 1 ENSMUSG00000000001 2 ENSMUSG00000000028 DESCRIPTION 1 guanine nucleotide binding protein (G protein), alpha inhibiting 3 [Source:MGI Symbol;Acc:MGI:95773] 2 cell division cycle 45 [Source:MGI Symbol;Acc:MGI:1338073] GENENAME 1 Gnai3 2 Cdc45 > # please note that the output contains only 2 entries, because no annotation info could be obtained for the last ID (ENSMUSG00000001379, which has been retired).> library(org.Hs.eg.db) > probeids <-c("1", "2", "14679") #again 3 genes > select(org.Hs.eg.db, keys=probeids, columns=c("ENTREZID","SYMBOL","GENENAME"),keytype="ENTREZID") 'select()' returned 1:1 mapping between keys and columns ENTREZID SYMBOL GENENAME 1 1 A1BG alpha-1-B glycoprotein 2 2 A2M alpha-2-macroglobulin 3 14679 <NA> <NA> > # please note that the output still contains 3 entries, although no annotation info could be obtained for the last ID (which actually is a mouse gene).Dear Guido, thanks for your suggestion. I've opened an issue over at the github repo https://github.com/jotsetung/ensembldb/issues/53 and will think of a possible solution.
AnnotationHub shouldn't fail on this warning, and this has been fixed in version 2.9.9, available in 'devel' in a day or so and in the next Bioconductor (3.6) release in October, 2017.