I wanted to use EnsDb.Hsapiens.v86 and org.Hs.eg.db combined, as both databases do not exactly display the same type of information (for example, only EnsDb.Hsapiens.v86 provides with biological function, or list of exons).
However, as you can see on below code, I was quite disappointed to notice that a great number of SYMBOL genes of org.Hs.eg.db weren't in EnsDb.Hsapiens.v86, and respectively.
> symbol_humans_ensembl <- unique(AnnotationDbi::keys(EnsDb.Hsapiens.v86, keytype = "SYMBOL"))
> length(symbol_humans_ensembl)
[1] 56643
> length(grep(pattern = "^RP11-", symbol_humans_ensembl))
[1] 12045
> head(symbol_humans_ensembl[grep(pattern = "^RP11-", symbol_humans_ensembl)])
[1] "RP11-1000B6.2" "RP11-1000B6.3" "RP11-1000B6.5" "RP11-1000B6.7" "RP11-1000B6.8" "RP11-1003J3.1"
length(setdiff(symbol_humans_ensembl, AnnotationDbi::keys(org.Hs.eg.db, keytype="SYMBOL")))
[1]22155
Wanting to investigate more why there was such a mismatch between both databases, I notice that a quite great number of SYMBOL genes of EnsDb.Hsapiens.v86 seem to be not curated, with for instance, among the 22155 genes of EnsDb.Hsapiens.v86 missing in org.Hs.eg.db, a total of 12045 genes being only prefixes of RP11, itself only considered only as an alias of more official SYMBOL PRPF31. On the contrary, many genes being listed both in NCBI database and in org.Hs.eg.db aren't present in EnsDb.Hsapiens.v86.
I have then two questions:
- why keeping so many gene SYMBOL seeming superfluous, or just composite, assembly versions of the exactly same gene in EnsDb.Hsapiens.v86?
- why, on the contrary, do not describe so many genes, including protein-coding ones in EnsDb.Hsapiens.v86, like for example ACLS or AAVS1 genes?
sessionInfo( )
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /softhpc/R/4.0.2/lib64/R/lib/libRblas.so
LAPACK: /softhpc/R/4.0.2/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
Adding to James answer,
the RP11 symbols are more gene names for long non-coding transcript. As James pointed out, this information is retrieved directly from the Ensembl core databases and it's also the information which is displayed for these genes in the Ensembl genome browser. In addition, you are using an
EnsDb
from Ensembl release 86, which is indeed a very old release. The most recent Ensembl release is 102 and symbols/gene names for many will have changed.Note that you can get up-to-date
EnsDb
databases fromAnnotationHub
:as you see we have databases for each release from 87 to the most recent one and I would suggest that if you use annotation resources from different providers (NCBI, Ensembl, ...) you should at least try to use versions that are from ~ the same time/release.