ensembldb for Homo Sapiens organism containing superfluous SYMBOL genes?
1
0
Entering edit mode
@5de73a99
Last seen 8 months ago

I wanted to use EnsDb.Hsapiens.v86 and org.Hs.eg.db combined, as both databases do not exactly display the same type of information (for example, only EnsDb.Hsapiens.v86 provides with biological function, or list of exons).

However, as you can see on below code, I was quite disappointed to notice that a great number of SYMBOL genes of org.Hs.eg.db weren't in EnsDb.Hsapiens.v86, and respectively.


> symbol_humans_ensembl <- unique(AnnotationDbi::keys(EnsDb.Hsapiens.v86, keytype = "SYMBOL"))
> length(symbol_humans_ensembl)
[1] 56643
> length(grep(pattern = "^RP11-", symbol_humans_ensembl))
[1] 12045
[1] "RP11-1000B6.2" "RP11-1000B6.3" "RP11-1000B6.5" "RP11-1000B6.7" "RP11-1000B6.8" "RP11-1003J3.1"
length(setdiff(symbol_humans_ensembl, AnnotationDbi::keys(org.Hs.eg.db, keytype="SYMBOL")))
[1]22155


Wanting to investigate more why there was such a mismatch between both databases, I notice that a quite great number of SYMBOL genes of EnsDb.Hsapiens.v86 seem to be not curated, with for instance, among the 22155 genes of EnsDb.Hsapiens.v86 missing in org.Hs.eg.db, a total of 12045 genes being only prefixes of RP11, itself only considered only as an alias of more official SYMBOL PRPF31. On the contrary, many genes being listed both in NCBI database and in org.Hs.eg.db aren't present in EnsDb.Hsapiens.v86.

I have then two questions:

• why keeping so many gene SYMBOL seeming superfluous, or just composite, assembly versions of the exactly same gene in EnsDb.Hsapiens.v86?
• why, on the contrary, do not describe so many genes, including protein-coding ones in EnsDb.Hsapiens.v86, like for example ACLS or AAVS1 genes?
sessionInfo( )
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /softhpc/R/4.0.2/lib64/R/lib/libRblas.so
LAPACK: /softhpc/R/4.0.2/lib64/R/lib/libRlapack.so

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

AnnotationDbi EnsDb.Hsapiens org.Hs.eg.db • 293 views
3
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

You are taking gene location data from EBI-EMBL and gene annotation data from NCBI and wondering why they don't agree. It's because they are based on two different ways of inferring what is and isn't a gene, and where it might be. And the gene symbols come from HUGO, so you are adding a third player into the mix, with predictable results.

I'll make three points.

1. None of the data provided in any of the annotation packages you have been asking about is modified in any way by either Bioconductor core, or in the case of the EnsDb objects, Johannes Rainier, who has done an incredible amount of work to provide them. These are simply convenient repackaging of existing data, and any questions about why the data are the way they are is +/- off-topic for this site.
2. The EnsDb packages are based on Ensembl data (hence the name) and the OrgDb packages are based on NCBI data (hence the 'eg' in the name). There are any number of disagreements between these two annotation services as to what is a gene, where it might be, what transcripts are known to arise from the gene, etc. If you try to compare things between those datasets you will inevitably run into disagreements, which is entirely predictable, has nothing to do with the packages (see item 1), and isn't answerable on this forum.
3. There seems to be this idea amongst some people that the human genome is some static thing that we know everything about, and in which case there shouldn't be these discrepancies. Nothing could be farther from the truth. We are in the very beginning stages of our exploration of the genome, and the more we explore, the more confusing it all becomes.
1
Entering edit mode

the RP11 symbols are more gene names for long non-coding transcript. As James pointed out, this information is retrieved directly from the Ensembl core databases and it's also the information which is displayed for these genes in the Ensembl genome browser. In addition, you are using an EnsDb from Ensembl release 86, which is indeed a very old release. The most recent Ensembl release is 102 and symbols/gene names for many will have changed.

Note that you can get up-to-date EnsDb databases from AnnotationHub:

> library(AnnotationHub)
> ah <- AnnotationHub()
snapshotDate(): 2020-10-27
> query(ah, "EnsDb.Hsapiens")
AnnotationHub with 17 records
# snapshotDate(): 2020-10-27
# $dataprovider: Ensembl #$species: Homo sapiens
# \$rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH53211"]]'

title
AH53211 | Ensembl 87 EnsDb for Homo Sapiens
AH53715 | Ensembl 88 EnsDb for Homo Sapiens
...       ...
AH83216 | Ensembl 101 EnsDb for Homo sapiens
AH89180 | Ensembl 102 EnsDb for Homo sapiens


as you see we have databases for each release from 87 to the most recent one and I would suggest that if you use annotation resources from different providers (NCBI, Ensembl, ...) you should at least try to use versions that are from ~ the same time/release.