I wanted to use EnsDb.Hsapiens.v86 and org.Hs.eg.db combined, as both databases do not exactly display the same type of information (for example, only EnsDb.Hsapiens.v86 provides with biological function, or list of exons).
However, as you can see on below code, I was quite disappointed to notice that a great number of SYMBOL genes of org.Hs.eg.db weren't in EnsDb.Hsapiens.v86, and respectively.
> symbol_humans_ensembl <- unique(AnnotationDbi::keys(EnsDb.Hsapiens.v86, keytype = "SYMBOL")) > length(symbol_humans_ensembl)  56643 > length(grep(pattern = "^RP11-", symbol_humans_ensembl))  12045 > head(symbol_humans_ensembl[grep(pattern = "^RP11-", symbol_humans_ensembl)])  "RP11-1000B6.2" "RP11-1000B6.3" "RP11-1000B6.5" "RP11-1000B6.7" "RP11-1000B6.8" "RP11-1003J3.1" length(setdiff(symbol_humans_ensembl, AnnotationDbi::keys(org.Hs.eg.db, keytype="SYMBOL"))) 22155
Wanting to investigate more why there was such a mismatch between both databases, I notice that a quite great number of SYMBOL genes of EnsDb.Hsapiens.v86 seem to be not curated, with for instance, among the 22155 genes of EnsDb.Hsapiens.v86 missing in org.Hs.eg.db, a total of 12045 genes being only prefixes of RP11, itself only considered only as an alias of more official SYMBOL PRPF31. On the contrary, many genes being listed both in NCBI database and in org.Hs.eg.db aren't present in EnsDb.Hsapiens.v86.
I have then two questions:
- why keeping so many gene SYMBOL seeming superfluous, or just composite, assembly versions of the exactly same gene in EnsDb.Hsapiens.v86?
- why, on the contrary, do not describe so many genes, including protein-coding ones in EnsDb.Hsapiens.v86, like for example ACLS or AAVS1 genes?
sessionInfo( ) R version 4.0.2 (2020-06-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS: /softhpc/R/4.0.2/lib64/R/lib/libRblas.so LAPACK: /softhpc/R/4.0.2/lib64/R/lib/libRlapack.so locale:  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C  LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8  LC_PAPER=en_US.UTF-8 LC_NAME=C  LC_ADDRESS=C LC_TELEPHONE=C  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages:  stats4 parallel stats graphics grDevices utils datasets  methods base