I am using org.Hs.eg.db to map ensemble ids (from Stringdb) to entrez ids. I used to be able to map almost all of the ensemble ids in Stringdb, but for some reason now I can't. For example
> unlist(mget(c("ENSP00000000233"), map, ifnotfound=NA)) ENSP00000000233 NA
And when I go to map the ensemble on ensemble.org (http://www.ensembl.org/Homo_sapiens/Search/Results?q=ENSP00000000233;site=ensembl;facet_species=Human), I get that the ensemble maps to ARF5. Is there a fix for this or is there a new bug somewhere in the new version of org.Hs.eg.db?
@JamesW.McDonald If you look at the protein aliases in the Stringdb network (you can download the protein.aliases.v10.txt.gz file from the String db website) you can see that ENSP00000000233 maps to "ADP-RIBOSYLATION FACTOR 5 [*103188]" which is ARF5.
That may well be true. But I think there is a fundamental misunderstanding here. We are simply packaging the data that exist in a couple of databases in a way that makes it easier to use. In the case of the org.Hs.eg.db package, the central database used is the Gene DB from NCBI, so the data are necessarily Entrez Gene ID-centric.
While there are quite a few mappings from Entrez Gene -> Ensembl IDs, it's not unusual for there to be disagreements between the two annotation groups, and so it's not that unexpected that an annotation database built using NCBI IDs would not be completely comprehensive when trying to map IDs from a different annotation group.
You will find the same exact issues if you use biomaRt or one of the EnsDb packages to map Ensembl protein IDs to Entrez Gene IDs. There are lots of gaps. The best advice is to stay within the annotation group from which you got your IDs. So if you have Ensembl based IDs, use biomaRt or EnsDb packages to do the mapping. If you have UCSC or NCBI IDs, then use the orgDb or TxDb packages that Bioconductor core supply.