Question

Problem with mapp ensemble protein ids to entrez ids using org.Hs.eg.db

0

Entering edit mode

Marcus Aurelius • 0

@marcus-aurelius-12445

Last seen 6.7 years ago

I am using org.Hs.eg.db to map ensemble ids (from Stringdb) to entrez ids. I used to be able to map almost all of the ensemble ids in Stringdb, but for some reason now I can't. For example

> unlist(mget(c("ENSP00000000233"), map, ifnotfound=NA))
ENSP00000000233 
             NA

And when I go to map the ensemble on ensemble.org (http://www.ensembl.org/Homo_sapiens/Search/Results?q=ENSP00000000233;site=ensembl;facet_species=Human), I get that the ensemble maps to ARF5. Is there a fix for this or is there a new bug somewhere in the new version of org.Hs.eg.db?

org.Hs.eg.db entrez gene identifiers ensemble • 1.8k views

ADD COMMENT • link updated 7.2 years ago by James W. MacDonald 65k • written 7.2 years ago by Marcus Aurelius • 0

score 0 · Answer 1 · 2017-02-25

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 hour ago

United States

If you do the corresponding query in the Biomart interface at ensembl.org, you get nothing returned, so this is an issue with the Biomart server, not the biomaRt package.

ADD COMMENT • link 7.2 years ago James W. MacDonald 65k

0

Entering edit mode

@JamesW.McDonald If you look at the protein aliases in the Stringdb network (you can download the protein.aliases.v10.txt.gz file from the String db website) you can see that ENSP00000000233 maps to "ADP-RIBOSYLATION FACTOR 5 [*103188]" which is ARF5.

ADD REPLY • link 7.1 years ago Poincare • 0

0

Entering edit mode

That may well be true. But I think there is a fundamental misunderstanding here. We are simply packaging the data that exist in a couple of databases in a way that makes it easier to use. In the case of the org.Hs.eg.db package, the central database used is the Gene DB from NCBI, so the data are necessarily Entrez Gene ID-centric.

While there are quite a few mappings from Entrez Gene -> Ensembl IDs, it's not unusual for there to be disagreements between the two annotation groups, and so it's not that unexpected that an annotation database built using NCBI IDs would not be completely comprehensive when trying to map IDs from a different annotation group.

You will find the same exact issues if you use biomaRt or one of the EnsDb packages to map Ensembl protein IDs to Entrez Gene IDs. There are lots of gaps. The best advice is to stay within the annotation group from which you got your IDs. So if you have Ensembl based IDs, use biomaRt or EnsDb packages to do the mapping. If you have UCSC or NCBI IDs, then use the orgDb or TxDb packages that Bioconductor core supply.

ADD REPLY • link 7.1 years ago James W. MacDonald 65k