Question: Faulty entry in, a force check perhaps?
gravatar for Dan Du
3.6 years ago by
Dan Du210
Dan Du210 wrote:

Hi Marc and others,

Recently a funny entry popup in during some internal mapping checkup,

select(, keys='16592263', columns=c('SYMBOL','ENTREZID'), keytype ='ENSEMBL')
#1 16592263 RPL21P28 100131205
select(, keys='100131205', columns=c('SYMBOL','ENSEMBL'))
#1 100131205 RPL21P28        16592263
#2 100131205 RPL21P28 ENSG00000220749 
#3 100131205 RPL21P28 ENSG00000213860 # only in 3.0.0, removed now
grep('ENSG', keys(, keytype = 'ENSEMBL'), invert=T, value=T)

I guess this is more a problem upstream from NCBI (, fixed now), where this 16592263 is coming from. So I wonder if it is possible to enforce some check on the ensemblID (Is it TRUE that all Ensembl gene id starts with ENS???scratching my head now). The good thing is that this is quite a unique case, and removing it is easy while nothing is lost.  This is true for at least 2 versions (, I have checked, so it has been there for a while.

We are reporting this to NCBI as well... some more affected people...



ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by Dan Du210
gravatar for Marc Carlson
3.6 years ago by
Marc Carlson7.2k
United States
Marc Carlson7.2k wrote:

You are correct that the data is just coming to us like that from NCBI.

You are also correct that I *could* easily filter out all IDs that don't start with 'ENS'.  And if I did that process, it would only remove that single goofy entry... 

ensk <- keys(, keytype ='ENSEMBL')
table(grepl('ENS', ensk))

But do you really want us 'pre-cleaning' this data for you?  What if your research project was to try and study the accuracy of NCBI data?  If so, I think we could have ruined if for you by filtering it.  So as much as is possible, I think we should probably try to give you the data in a format that is as close to the original thing as we can.

Also: please feel free to write to NCBI and ask them about problems that they may have with their data.



ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by Marc Carlson7.2k

Hi Marc,

Fair enough, I agree forcing a removal is kind of crossing the line here, this should be the job of NCBI. However I would still suggest at least issue a warning when certain keys returned by keys(x, keytypes='sometype') clearly violating the naming convention. Here is a small patch for AnnotationDbi in devel.


--- methods-geneCentricDbs.R    (revision 103184)

+++ methods-geneCentricDbs.R    (working copy)

@@ -1296,8 +1296,11 @@


         keytype <- .chooseCentralOrgPkgSymbol(x)


-      smartKeys(x=x, keytype=keytype, ..., FUN=.keys)

-  }

+      res<-smartKeys(x=x, keytype=keytype, ..., FUN=.keys)

+      if(grepl('ENSEMBL', keytype) & any(!grepl('ENS', res))) {

+        warning('Some keys returned may not be valid of type ', keytype)

+      }  

+    }


#seems not that simple to patch, forget it



ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Dan Du210

Hi Dan


...but be carefull:

ensembl identifier do not always start with 'ENS'. Especially in those species, where the annotation is imported from other resources, e.g. C. elegans:

> library(

> select(, keys='180445', columns=c('SYMBOL','ENSEMBL'), keytype ='ENTREZID')
1   180445  sms-2 WBGene00004893


see also:;g=WBGene00004893


Regards, Hans-Rudolf


ADD REPLYlink written 3.6 years ago by Hotz, Hans-Rudolf400

Hi Hans-Rudolf,

Thanks for the example, I thought this might be the case, obviously have not been working with worms before. But good to know, so no simple way to check this then. 

Case closed.



ADD REPLYlink written 3.6 years ago by Dan Du210
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 226 users visited in the last hour