Problem with mapp ensemble protein ids to entrez ids using org.Hs.eg.db
1
0
Entering edit mode
@marcus-aurelius-12445
Last seen 7.2 years ago

I am using org.Hs.eg.db to map ensemble ids (from Stringdb) to entrez ids. I used to be able to map almost all of the ensemble ids in Stringdb, but for some reason now I can't. For example

> unlist(mget(c("ENSP00000000233"), map, ifnotfound=NA))
ENSP00000000233 
             NA 

And when I go to map the ensemble on ensemble.org (http://www.ensembl.org/Homo_sapiens/Search/Results?q=ENSP00000000233;site=ensembl;facet_species=Human), I get that the ensemble maps to ARF5. Is there a fix for this or is there a new bug somewhere in the new version of org.Hs.eg.db?

org.Hs.eg.db entrez gene identifiers ensemble • 2.0k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

If you do the corresponding query in the Biomart interface at ensembl.org, you get nothing returned, so this is an issue with the Biomart server, not the biomaRt package.

ADD COMMENT
0
Entering edit mode

@JamesW.McDonald If you look at the protein aliases in the Stringdb network (you can download the protein.aliases.v10.txt.gz file from the String db website) you can see that ENSP00000000233 maps to "ADP-RIBOSYLATION FACTOR 5 [*103188]" which is ARF5.

ADD REPLY
0
Entering edit mode

That may well be true. But I think there is a fundamental misunderstanding here. We are simply packaging the data that exist in a couple of databases in a way that makes it easier to use. In the case of the org.Hs.eg.db package, the central database used is the Gene DB from NCBI, so the data are necessarily Entrez Gene ID-centric.

While there are quite a few mappings from Entrez Gene -> Ensembl IDs, it's not unusual for there to be disagreements between the two annotation groups, and so it's not that unexpected that an annotation database built using NCBI IDs would not be completely comprehensive when trying to map IDs from a different annotation group.

You will find the same exact issues if you use biomaRt or one of the EnsDb packages to map Ensembl protein IDs to Entrez Gene IDs. There are lots of gaps. The best advice is to stay within the annotation group from which you got your IDs. So if you have Ensembl based IDs, use biomaRt or EnsDb packages to do the mapping. If you have UCSC or NCBI IDs, then use the orgDb or TxDb packages that Bioconductor core supply.

ADD REPLY

Login before adding your answer.

Traffic: 768 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6