Search
Question: Faulty entry in org.Hs.eg.db, a force check perhaps?
0
gravatar for Dan Du
2.6 years ago by
Dan Du210
Germany
Dan Du210 wrote:

Hi Marc and others,

Recently a funny entry popup in org.Hs.eg.db during some internal mapping checkup,

library(org.Hs.eg.db)
select(org.Hs.eg.db, keys='16592263', columns=c('SYMBOL','ENTREZID'), keytype ='ENSEMBL')
#   ENSEMBL   SYMBOL  ENTREZID
#1 16592263 RPL21P28 100131205
select(org.Hs.eg.db, keys='100131205', columns=c('SYMBOL','ENSEMBL'))
#   ENTREZID   SYMBOL         ENSEMBL
#1 100131205 RPL21P28        16592263
#2 100131205 RPL21P28 ENSG00000220749 
#3 100131205 RPL21P28 ENSG00000213860 # only in 3.0.0, removed now
grep('ENSG', keys(org.Hs.eg.db, keytype = 'ENSEMBL'), invert=T, value=T)
#16592263

I guess this is more a problem upstream from NCBI (http://www.ncbi.nlm.nih.gov/gene/100131205, fixed now), where this 16592263 is coming from. So I wonder if it is possible to enforce some check on the ensemblID (Is it TRUE that all Ensembl gene id starts with ENS???scratching my head now). The good thing is that this is quite a unique case, and removing it is easy while nothing is lost.  This is true for at least 2 versions (org.Hs.eg.db_3.0.0, org.Hs.eg.db_3.1.2) I have checked, so it has been there for a while.

We are reporting this to NCBI as well... some more affected people...

http://www.genome.jp/dbget-bin/www_bget?hsa:100131205

http://www.broadinstitute.org/~atsankov/ChIP_data/superEnhancers/SE5k_H3K4me3_h64/Ann.out

Best,

Dan

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Dan Du210
0
gravatar for Marc Carlson
2.6 years ago by
Marc Carlson7.2k
United States
Marc Carlson7.2k wrote:

You are correct that the data is just coming to us like that from NCBI. 

http://www.ncbi.nlm.nih.gov/gene/?term=100131205

You are also correct that I *could* easily filter out all IDs that don't start with 'ENS'.  And if I did that process, it would only remove that single goofy entry... 

library(org.Hs.eg.db)
ensk <- keys(org.Hs.eg.db, keytype ='ENSEMBL')
table(grepl('ENS', ensk))

But do you really want us 'pre-cleaning' this data for you?  What if your research project was to try and study the accuracy of NCBI data?  If so, I think we could have ruined if for you by filtering it.  So as much as is possible, I think we should probably try to give you the data in a format that is as close to the original thing as we can.

Also: please feel free to write to NCBI and ask them about problems that they may have with their data.

 

 Marc

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Marc Carlson7.2k

Hi Marc,

Fair enough, I agree forcing a removal is kind of crossing the line here, this should be the job of NCBI. However I would still suggest at least issue a warning when certain keys returned by keys(x, keytypes='sometype') clearly violating the naming convention. Here is a small patch for AnnotationDbi in devel.

===================================================================

--- methods-geneCentricDbs.R    (revision 103184)

+++ methods-geneCentricDbs.R    (working copy)

@@ -1296,8 +1296,11 @@

       if(missing(keytype)){

         keytype <- .chooseCentralOrgPkgSymbol(x)

       }

-      smartKeys(x=x, keytype=keytype, ..., FUN=.keys)

-  }

+      res<-smartKeys(x=x, keytype=keytype, ..., FUN=.keys)

+      if(grepl('ENSEMBL', keytype) & any(!grepl('ENS', res))) {

+        warning('Some keys returned may not be valid of type ', keytype)

+      }  

+    }

 ) 

#seems not that simple to patch, forget it

Best,

Dan

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Dan Du210

Hi Dan

 

...but be carefull:

ensembl identifier do not always start with 'ENS'. Especially in those species, where the annotation is imported from other resources, e.g. C. elegans:

> library(org.Ce.eg.db)

> select(org.Ce.eg.db, keys='180445', columns=c('SYMBOL','ENSEMBL'), keytype ='ENTREZID')
  ENTREZID SYMBOL        ENSEMBL
1   180445  sms-2 WBGene00004893
>

 

see also: http://www.ensembl.org/Caenorhabditis_elegans/Gene/Summary?db=core;g=WBGene00004893

 

Regards, Hans-Rudolf

 

ADD REPLYlink written 2.6 years ago by Hotz, Hans-Rudolf380

Hi Hans-Rudolf,

Thanks for the example, I thought this might be the case, obviously have not been working with worms before. But good to know, so no simple way to check this then. 

Case closed.

Best,

Dan

ADD REPLYlink written 2.6 years ago by Dan Du210
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 312 users visited in the last hour