Question

ensembldb for Homo Sapiens organism containing superfluous SYMBOL genes?

0

Entering edit mode

bastien_chassagnol • 0

@5de73a99

Last seen 3.2 years ago

I wanted to use EnsDb.Hsapiens.v86 and org.Hs.eg.db combined, as both databases do not exactly display the same type of information (for example, only EnsDb.Hsapiens.v86 provides with biological function, or list of exons).

However, as you can see on below code, I was quite disappointed to notice that a great number of SYMBOL genes of org.Hs.eg.db weren't in EnsDb.Hsapiens.v86, and respectively.


> symbol_humans_ensembl <- unique(AnnotationDbi::keys(EnsDb.Hsapiens.v86, keytype = "SYMBOL"))
> length(symbol_humans_ensembl)
[1] 56643
> length(grep(pattern = "^RP11-", symbol_humans_ensembl))
[1] 12045
> head(symbol_humans_ensembl[grep(pattern = "^RP11-", symbol_humans_ensembl)])
[1] "RP11-1000B6.2" "RP11-1000B6.3" "RP11-1000B6.5" "RP11-1000B6.7" "RP11-1000B6.8" "RP11-1003J3.1"
length(setdiff(symbol_humans_ensembl, AnnotationDbi::keys(org.Hs.eg.db, keytype="SYMBOL")))
[1]22155

Wanting to investigate more why there was such a mismatch between both databases, I notice that a quite great number of SYMBOL genes of EnsDb.Hsapiens.v86 seem to be not curated, with for instance, among the 22155 genes of EnsDb.Hsapiens.v86 missing in org.Hs.eg.db, a total of 12045 genes being only prefixes of RP11, itself only considered only as an alias of more official SYMBOL PRPF31. On the contrary, many genes being listed both in NCBI database and in org.Hs.eg.db aren't present in EnsDb.Hsapiens.v86.

I have then two questions:

why keeping so many gene SYMBOL seeming superfluous, or just composite, assembly versions of the exactly same gene in EnsDb.Hsapiens.v86?
why, on the contrary, do not describe so many genes, including protein-coding ones in EnsDb.Hsapiens.v86, like for example ACLS or AAVS1 genes?

sessionInfo( )
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /softhpc/R/4.0.2/lib64/R/lib/libRblas.so
LAPACK: /softhpc/R/4.0.2/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base

AnnotationDbi EnsDb.Hsapiens org.Hs.eg.db • 1.4k views

ADD COMMENT • link updated 3.2 years ago by Johannes Rainer ★ 2.0k • written 3.3 years ago by bastien_chassagnol • 0

score 3 · Accepted Answer · 2021-01-08

You are taking gene location data from EBI-EMBL and gene annotation data from NCBI and wondering why they don't agree. It's because they are based on two different ways of inferring what is and isn't a gene, and where it might be. And the gene symbols come from HUGO, so you are adding a third player into the mix, with predictable results.

I'll make three points.

None of the data provided in any of the annotation packages you have been asking about is modified in any way by either Bioconductor core, or in the case of the EnsDb objects, Johannes Rainier, who has done an incredible amount of work to provide them. These are simply convenient repackaging of existing data, and any questions about why the data are the way they are is +/- off-topic for this site.
The EnsDb packages are based on Ensembl data (hence the name) and the OrgDb packages are based on NCBI data (hence the 'eg' in the name). There are any number of disagreements between these two annotation services as to what is a gene, where it might be, what transcripts are known to arise from the gene, etc. If you try to compare things between those datasets you will inevitably run into disagreements, which is entirely predictable, has nothing to do with the packages (see item 1), and isn't answerable on this forum.
There seems to be this idea amongst some people that the human genome is some static thing that we know everything about, and in which case there shouldn't be these discrepancies. Nothing could be farther from the truth. We are in the very beginning stages of our exploration of the genome, and the more we explore, the more confusing it all becomes.