Question

Duplicated ENTREZID gene annotations org.Hs.eg.db

0

Entering edit mode

bastien_chassagnol • 0

@5de73a99

Last seen 3.2 years ago

As I wanted to retrieve all relevant genetic annotation for human genome using AnnotationDbi package as well as reference database org.Hs.eg.db, I have noticed following seemingly odd observation.

NCBI_gene_annotationDbi <- AnnotationDbi::select(org.Hs.eg.db, keys=AnnotationDbi::keys(org.Hs.eg.db, keytype="SYMBOL"), columns=c("SYMBOL", "GENENAME","ENTREZID"), keytype="SYMBOL")

duplicated_genes <- NCBI_gene_annotationDbi$SYMBOL[duplicated(NCBI_gene_annotationDbi$SYMBOL)]

NCBI_gene_annotationDbi_duplicated <- NCBI_gene_annotationDbi %>% dplyr::filter(SYMBOL %in% duplicated_genes)

duplicated entrezid table

As you can notice, only a small fraction (8 symbol over a total 61538) of SYMBOL have duplicated inputs in ENTREZID field. Besides, most of these duplicated values often integrate a general description and annotation of the gene, while the other ones seem superfluous (for instance, HBD has two ENTREZID matches, one general about hemoglobin subunit deltat, and another one really specific, and not describing its function: bone disease).

For such a small number of duplicated inputs, can't it be possible to clear all these seemingly superfluous one-to-many matches, only keeping for each the most relevant one. Or maybe merge both descriptions?

sessionInfo( )

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /softhpc/R/4.0.2/lib64/R/lib/libRblas.so
LAPACK: /softhpc/R/4.0.2/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] GO.db_3.12.1              hgu133plus2probe_2.18.0   hgu133plus2.db_3.2.3      org.Hs.eg.db_3.12.0      
 [5] hgu133plus2cdf_2.18.0     bmkanalysis_1.0.0         testthat_3.0.1            affy_1.68.0              
 [9] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.14.0          AnnotationFilter_1.14.0   GenomicFeatures_1.42.1   
[13] AnnotationDbi_1.52.0      Biobase_2.50.0            GenomicRanges_1.42.0      GenomeInfoDb_1.26.2      
[17] IRanges_2.24.1            S4Vectors_0.28.1          BiocGenerics_0.36.0

ENTREZID SYMBOL AnnotationDbi org.Hs.eg.db • 2.5k views

ADD COMMENT • link updated 3.3 years ago by Gordon Smyth 50k • written 3.3 years ago by bastien_chassagnol • 0

score 2 · Accepted Answer · 2021-01-07

2

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 34 minutes ago

United States

The OrgDb packages are simply a re-packaging of existing data we get from NCBI (and some other places). We make no attempt to do anything more than ensuring that we get the current version of files from the annotation source and ensure that those data are parsed correctly and put into the underlying SQLite database. We don't have the ability to go through all the data we download and decide, on a case by case basis, whether or not there are superfluous matches.

Anyway, I wouldn't characterize much of what you present as superfluous matches. Although Biologists tend to rely on HUGO gene symbols, and think they are in some sense unique, they aren't. As evidenced by your list. I would be hard pressed to think that 'hemoglobin subunit delta' and 'hyphophosphatemic bone disease' (both HBD) are in any way the same thing.

Additionally, you should beware using more than one column for a search. The underlying query is a SQL inner join, and can balloon quite rapidly when you have any one-to-many matches.

ADD COMMENT • link 3.3 years ago James W. MacDonald 65k

0

Entering edit mode

I see what you mean. However, only a tiny fraction of all SYMBOL genes match more than one ENTREZID input (precisely 8 symbol over a total 61538, with 17 cases to deal with). Seeing that, I would be highly tempted to consider either some of descriptions are redundant, or as you suggest, that they do describe different biological units, and as a result, should have their own unique SYMBOL input.

Taking HBD example, I have then checked on TxDb.Hsapiens.UCSC.hg19.knownGene database, and EnsDb.Hsapiens.v86. Conclusion: only one gene was reported, the one bearing ENTREZID 3045. In that particular case, hyphophosphatemic bone disease input seems to be erroneous, superfluous, or maybe really specific to a biological condition (seems odd to only describe a gene by the disease it triggers, doesn't it)?

Anyway, do you know which channel I could use to report directly NCBI from that possible mistake/redundancy?

ADD REPLY • link 3.3 years ago bastien_chassagnol • 0

1

Entering edit mode

The latter is a GeneRIF, which is something I had no idea existed until like five minutes ago. It's intended to allow people to add functional annotation to genes? But somehow there doesn't seem to be a gene involved for the GeneRIF HBP. But there do seem to be lots of these things. And the examples they provide appear to be real, like you know, genes.

So, I don't know. You could probably talk to somebody at NCBI to see what's up. I don't have any insight as to whomever that might be, so if you really care it's on you to find out.

ADD REPLY • link 3.3 years ago James W. MacDonald 65k

0

Entering edit mode

Thanks for trying to enlighten me on the subject, and if I get newer information, I will come back to you.

ADD REPLY • link 3.3 years ago bastien_chassagnol • 0

score 2 · Accepted Answer · 2021-01-07

2

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 5 hours ago

WEHI, Melbourne, Australia

The phenomenon you have discovered is well known. It's not a mistake, nor do most people find it a problem. You may find it inconvenient, but mammalian transcription is complicated. All gene annotation systems have incompatibilities and make unavoidable simplifications in an attempt to interpret an infinitely complex system within the framework of "genes". In this case you are comparing HUGO to Entrez. I suggest you try Ensembl, the other major system, then you might become more grateful for the simplicity and consistency of Entrez.

As I have remarked on this forum before, it is unfortunate that org.Hs.eg.db does not store the "Gene type" annotation provided by NCBI.

When I do expression analyses of human genes, I remove things that are not genes in the normal sense (Gene type=other or tRNA) and I restrict to reference chromosomes (removing Mitochondria and unassembled scaffolds). You will find that the redundancies between symbols and Entrez Ids then disappear.

ADD COMMENT • link 3.3 years ago Gordon Smyth 50k

0

Entering edit mode

On a previously curated database, I had only 60 genes having for one HGNC symbol two Ensembl inputs. And only one, SPATA13, which has three Ensembl inputs: "ENSG00000182957", "ENSG00000228741", "ENSG00000273167". According to that forum: https://www.biostars.org/p/16505/, and to these observations, I suppose that in a near future, we will have exactly, and tend to have a one-to-match for ENTREZID, HGNC and Ensembl inputs, at least hopefully.

ADD REPLY • link 3.3 years ago bastien_chassagnol • 0

0

Entering edit mode

Many of the Ensembl gene symbols are not even recognized by HGNC.

IMO it is very unlikely and not even desirable that Entrez, Ensembl and HUGO will become 1-1 in the near future. My prediction is that they will converge in some respects but, as all the databases gradually try to reflect more and more of the complexity of mammalian transcription, that the number of disagreements and unique entries will continue to expand. The different databases have different priorities and so will continue to make compromises in somewhat different ways.

ADD REPLY • link 3.3 years ago Gordon Smyth 50k