Hi Marc and others,
Recently a funny entry popup in org.Hs.eg.db during some internal mapping checkup,
library(org.Hs.eg.db) select(org.Hs.eg.db, keys='16592263', columns=c('SYMBOL','ENTREZID'), keytype ='ENSEMBL') # ENSEMBL SYMBOL ENTREZID #1 16592263 RPL21P28 100131205 select(org.Hs.eg.db, keys='100131205', columns=c('SYMBOL','ENSEMBL')) # ENTREZID SYMBOL ENSEMBL #1 100131205 RPL21P28 16592263 #2 100131205 RPL21P28 ENSG00000220749 #3 100131205 RPL21P28 ENSG00000213860 # only in 3.0.0, removed now grep('ENSG', keys(org.Hs.eg.db, keytype = 'ENSEMBL'), invert=T, value=T) #16592263
I guess this is more a problem upstream from NCBI (http://www.ncbi.nlm.nih.gov/gene/100131205, fixed now), where this 16592263 is coming from. So I wonder if it is possible to enforce some check on the ensemblID (Is it TRUE that all Ensembl gene id starts with ENS???scratching my head now). The good thing is that this is quite a unique case, and removing it is easy while nothing is lost. This is true for at least 2 versions (org.Hs.eg.db_3.0.0, org.Hs.eg.db_3.1.2) I have checked, so it has been there for a while.
We are reporting this to NCBI as well... some more affected people...
http://www.genome.jp/dbget-bin/www_bget?hsa:100131205
http://www.broadinstitute.org/~atsankov/ChIP_data/superEnhancers/SE5k_H3K4me3_h64/Ann.out
Best,
Dan
Hi Marc,
Fair enough, I agree forcing a removal is kind of crossing the line here, this should be the job of NCBI. However I would still suggest at least issue a warning when certain keys returned by keys(x, keytypes='sometype') clearly violating the naming convention.
Here is a small patch for AnnotationDbi in devel.#seems not that simple to patch, forget it
Best,
Dan
Hi Dan
...but be carefull:
ensembl identifier do not always start with 'ENS'. Especially in those species, where the annotation is imported from other resources, e.g. C. elegans:
> library(org.Ce.eg.db)
> select(org.Ce.eg.db, keys='180445', columns=c('SYMBOL','ENSEMBL'), keytype ='ENTREZID')
ENTREZID SYMBOL ENSEMBL
1 180445 sms-2 WBGene00004893
>
see also: http://www.ensembl.org/Caenorhabditis_elegans/Gene/Summary?db=core;g=WBGene00004893
Regards, Hans-Rudolf
Hi Hans-Rudolf,
Thanks for the example, I thought this might be the case, obviously have not been working with worms before. But good to know, so no simple way to check this then.
Case closed.
Best,
Dan