Question

Weirdness with the human org.Hs.eg.db annotation database

0

Entering edit mode

Marco Blanchette ▴ 220

@marco-blanchette-5439

Last seen 9.5 years ago

United States/Kansas City/Stowers Insti…

I fumble on some oddities while using the org.Hs.eg.db. Unless I am

missing something in how to work the the AnnotationDbi package, it seems

that the Hs database as faulty entries in it.

There seems to be a one to many relationships between transcripts and

genes. I.e. That individual transcript map to more than one genes (which

can¹t be). Here is how I came across that. Please correct me if I am

making the wrong assumptions while using select

library(org.Hs.eg.db)

k <- keys(org.Hs.eg.db,"ENSEMBLTRANS") ## Retrieving all the transcript ids from the db

t <- select(org.Hs.eg.db,k,"SYMBOL","ENSEMBLTRANS")  ## Retreiving the gene symbol associated with the transcripts ids

f <- duplicated(t$ENSEMBLTRANS) ## Do I get duplicated transcript ids?

sum(f) ## Yes 13478

t[t$ENSEMBLTRANS %in% t$ENSEMBLTRANS[f][1],]  ## Show me one example

In this example, the transcript ENST00000331925 is associated with both

ACTG1 (Actin Gamma 1) and ACTB (Actin Beta). Both in NCBI and Ensembl, the

ENST00000331925 transcripts return only a single human gene ACTG1. Not

sure whre ACTB got link to it.

Is that per design or it¹s a bug in the db.

Thanks

annotationdbi homo sapiens • 2.4k views

ADD COMMENT • link updated 9.5 years ago by Marc Carlson ★ 7.2k • written 9.5 years ago by Marco Blanchette ▴ 220

0

Entering edit mode

This appears to be a recent phenomenon:

> select(org.Hs.eg.db, "ENST00000331925", c("SYMBOL","GENENAME"), "ENSEMBLTRANS")
     ENSEMBLTRANS SYMBOL       GENENAME
1 ENST00000331925  ACTG1 actin, gamma 1

> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] org.Hs.eg.db_2.10.1  RSQLite_0.11.4       DBI_0.2-7           
[4] AnnotationDbi_1.24.0 Biobase_2.22.0       BiocGenerics_0.8.0  

loaded via a namespace (and not attached):
[1] IRanges_1.20.7 stats4_3.0.2

But in a more recent version of BioC:

> select(org.Hs.eg.db, "ENST00000331925", c("SYMBOL","GENENAME"), "ENSEMBLTRANS")
     ENSEMBLTRANS SYMBOL       GENENAME
1 ENST00000331925  ACTG1 actin, gamma 1
2 ENST00000331925   ACTB    actin, beta
Warning message:
In .generateExtraRows(tab, keys, jointype) :
  'select' resulted in 1:many mapping between keys and return rows
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] org.Hs.eg.db_2.14.0  RSQLite_0.11.4       DBI_0.3.1           
[4] AnnotationDbi_1.26.1 GenomeInfoDb_1.0.2   Biobase_2.24.0      
[7] BiocGenerics_0.10.0

loaded via a namespace (and not attached):
[1] IRanges_1.22.10 stats4_3.1.0

ADD REPLY • link 9.5 years ago James W. MacDonald 65k

score 0 · Answer 1 · 2014-10-29

I did some digging in the code that I use to make these org packages and I traced back the origin of this particular mapping data back to ensembl. We primarily use NCBI for our entrez gene based packages, but so many people need to use ensembl data that we also supplement the mapping of our ensembl IDs with data directly from ensembl. Ensembl is used for connecting ensembl transcript, ensembl protein and ensembl gene ids to the respective entrez gene IDs). You can see for yourself how this would have happened with this code here:

library(biomaRt)

## get the mart
mart = useMart('ensembl',dataset='hsapiens_gene_ensembl')

## get all of the data from ensembl for entrez gene ids to ensembl ids
egdata = getBM(c("ensembl_gene_id","entrezgene"), mart = mart)

## Now if you look at the data for id ENSG00000184009 
## You can see that the data matches two separate 
## (and different) entrez gene ids.
egdata[egdata$ensembl_gene_id=='ENSG00000184009',]

So that pointed to data coming from ensembl as the source for this strange result. But ensembl is a highly reliable data source (this is why we use them to help build our ensembl to entrez gene ids). So I contacted them to ask what was happening and they pointed me here. And if you scroll down you can see that there are in fact entrez gene IDs for both ACTB and ACTG1. So how did that happen?

Thomas Maurel, patiently explained the following to me in a correspondence when I asked him about it. I felt his explanation was very good (and also that he deserves credit for doing this part of the investigation), so I am re-posting that part of his response here:

"Our cross reference mapping system is quite complex but as a general rule an Ensembl Gene, Transcript or Translation ID can be linked to multiple external ids from the same source.

All the EntrezGene ids are imported into Ensembl via RefSeq mappings. For this example, we see that all RefSeq mappings we have for this gene (via transcripts and translations) correspond to ACTB, apart from one.

The predicted transcript, XM_006722048.1, aligns against one of the transcripts and corresponds to ACTG1, according to NCBI annotation.

This can also be verified using the website:
http://www.ensembl.org/Homo_sapiens/Share/2b2a0c24d24821539cade73652314960162085327

If you look at the bottom, all the mapped RefSeq sequences, you can see which gene they correspond to when clicking on the individual links.

XM_006722048.1 aligns correctly against the Ensembl transcript ENST00000573283, and only this one.

It seems that the HGNC name agrees with the name we would get via the predicted sequence, and not the curated RefSeq entries. The curated RefSeq entry is mapped via overlap. On the following link
http://www.ensembl.org/Homo_sapiens/Share/95e7c9b5184b4971ae83e302e7b088b4162085327, you can see that 5 RefSeq entries, all corresponding to ACTB, overlap our Ensembl gene.

To conclude, until the various resources (here, HGNC and RefSeq), agree on a same name, we can only do our best and display all the information we have available."

Anyhow that seems like a pretty complete explanation of the current circumstance to me. I hope that you are satisfied with it. But if not, you now know who to try and contact (NCBI's refseq resouces) about the apparent discrepancy.

Marc