I have an ENSEMBL-based RNA-seq dataset that I would like to annotate. For that I use the library EnsDb.Mmusculus.v79
. However, I am somehow not able to retrieve the gene name (description). Based on the retrievable annotation info, I assumed this would be accessible through the argument columns = "GENENAME"
, but this turns out not to be the case (that rather is the same as "SYMBOL"
). See example code below. Am I missing something obvious? Any advise on how to do this would be appreciated.
> library(EnsDb.Mmusculus.v79) > > keys <- keys(EnsDb.Mmusculus.v79)[1:15] > anno.result <- select(EnsDb.Mmusculus.v79, keys=keys, columns=c("GENEID","SYMBOL","GENENAME","ENTREZID"),keytype="GENEID") > > anno.result GENEID SYMBOL GENENAME ENTREZID 1 ENSMUSG00000000001 Gnai3 Gnai3 14679 2 ENSMUSG00000000003 Pbsn Pbsn 54192 3 ENSMUSG00000000028 Cdc45 Cdc45 12544 4 ENSMUSG00000000031 H19 H19 5 ENSMUSG00000000037 Scml2 Scml2 107815 6 ENSMUSG00000000049 Apoh Apoh 11818 7 ENSMUSG00000000056 Narf Narf 67608 8 ENSMUSG00000000058 Cav2 Cav2 12390 9 ENSMUSG00000000078 Klf6 Klf6 23849 10 ENSMUSG00000000085 Scmh1 Scmh1 11 ENSMUSG00000000088 Cox5a Cox5a 12858 12 ENSMUSG00000000093 Tbx2 Tbx2 21385 13 ENSMUSG00000000094 Tbx4 Tbx4 21387 14 ENSMUSG00000000103 Zfy2 Zfy2 15 ENSMUSG00000000120 Ngfr Ngfr 18053 >
For the first gene Gnai3 (ENSMUSG00000000001), description should be "guanine nucleotide binding protein (G protein), alpha inhibiting 3" (according to ENSEMBL website here).
> columns(EnsDb.Mmusculus.v79) [1] "ENTREZID" "EXONID" "EXONIDX" [4] "EXONSEQEND" "EXONSEQSTART" "GENEBIOTYPE" [7] "GENEID" "GENENAME" "GENESEQEND" [10] "GENESEQSTART" "INTERPROACCESSION" "ISCIRCULAR" [13] "PROTDOMEND" "PROTDOMSTART" "PROTEINDOMAINID" [16] "PROTEINDOMAINSOURCE" "PROTEINID" "PROTEINSEQUENCE" [19] "SEQCOORDSYSTEM" "SEQLENGTH" "SEQNAME" [22] "SEQSTRAND" "SYMBOL" "TXBIOTYPE" [25] "TXCDSSEQEND" "TXCDSSEQSTART" "TXID" [28] "TXNAME" "TXSEQEND" "TXSEQSTART" [31] "UNIPROTDB" "UNIPROTID" "UNIPROTMAPPINGTYPE" >
> sessionInfo() R version 3.4.0 Patched (2017-05-10 r72670) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] EnsDb.Mmusculus.v79_2.1.0 ensembldb_2.0.4 [3] AnnotationFilter_1.0.0 GenomicFeatures_1.28.4 [5] AnnotationDbi_1.38.2 Biobase_2.36.2 [7] GenomicRanges_1.28.4 GenomeInfoDb_1.12.2 [9] IRanges_2.10.2 S4Vectors_0.14.3 [11] BiocGenerics_0.22.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.12 BiocInstaller_1.26.0 [3] AnnotationHub_2.8.2 compiler_3.4.0 [5] XVector_0.16.0 ProtGenerics_1.8.0 [7] bitops_1.0-6 tools_3.4.0 [9] zlibbioc_1.22.0 biomaRt_2.32.1 [11] digest_0.6.12 bit_1.1-12 [13] RSQLite_2.0 memoise_1.1.0 [15] tibble_1.3.3 lattice_0.20-35 [17] pkgconfig_2.0.1 rlang_0.1.2 [19] Matrix_1.2-10 shiny_1.0.3 [21] DelayedArray_0.2.7 DBI_0.7 [23] curl_2.8.1 yaml_2.1.14 [25] GenomeInfoDbData_0.99.0 httr_1.2.1 [27] rtracklayer_1.36.4 Biostrings_2.44.2 [29] bit64_0.9-7 grid_3.4.0 [31] R6_2.2.2 XML_3.98-1.9 [33] BiocParallel_1.10.1 blob_1.1.0 [35] htmltools_0.3.6 Rsamtools_1.28.0 [37] matrixStats_0.52.2 GenomicAlignments_1.12.1 [39] SummarizedExperiment_1.6.3 xtable_1.8-2 [41] mime_0.5 interactiveDisplayBase_1.14.0 [43] httpuv_1.3.5 RCurl_1.95-4.8 [45] lazyeval_0.2.0 >
Thanks Jo for your explanation and code! I will follow your recommendation to use
AnnotationHub
. However, a quick try revealed that I didn't get it to work; I am using R-3.4.0, but apparently R-3.4.1 is required... usually this isn't such strict problem. Anyway, I hope to do this soon. Or do I misunderstand the error?Problem solved! It turned out the library
ensembldb
was not loaded.... After explicitly doing this, everything worked as expected!Dear Johannes, small follow-up question/feature request:
I noticed that when using the
select()
interface with theEnsDb
, Gene IDs that are not current anymore (this IDs that are not present in theEnsDb
) are not returned at all, which in such cases results in a shorter output than input list. Although I can work my way around this, it would be nice if the output would still contain these unmapped genes (with <na>'s), analogous to the output ofselect()
used on e.g. theOrg.Hs.eg.db
. See below for example.FYI: I noticed this when analyzing a mouse RNA-seq dataset downloaded from GEO, but there nor in the paper it is stated which version of Ensembl was used for mapping. I therefore just used a recent EnsDb version available through the AnnotationHub.
Dear Guido, thanks for your suggestion. I've opened an issue over at the github repo https://github.com/jotsetung/ensembldb/issues/53 and will think of a possible solution.
AnnotationHub shouldn't fail on this warning, and this has been fixed in version 2.9.9, available in 'devel' in a day or so and in the next Bioconductor (3.6) release in October, 2017.