Faulty entry in org.Hs.eg.db, a force check perhaps?
1
0
Entering edit mode
Dan Du ▴ 210
@dan-du-5270
Last seen 10 months ago
Germany

Hi Marc and others,

Recently a funny entry popup in org.Hs.eg.db during some internal mapping checkup,

library(org.Hs.eg.db)
select(org.Hs.eg.db, keys='16592263', columns=c('SYMBOL','ENTREZID'), keytype ='ENSEMBL')
#   ENSEMBL   SYMBOL  ENTREZID
#1 16592263 RPL21P28 100131205
select(org.Hs.eg.db, keys='100131205', columns=c('SYMBOL','ENSEMBL'))
#   ENTREZID   SYMBOL         ENSEMBL
#1 100131205 RPL21P28        16592263
#2 100131205 RPL21P28 ENSG00000220749 
#3 100131205 RPL21P28 ENSG00000213860 # only in 3.0.0, removed now
grep('ENSG', keys(org.Hs.eg.db, keytype = 'ENSEMBL'), invert=T, value=T)
#16592263

I guess this is more a problem upstream from NCBI (http://www.ncbi.nlm.nih.gov/gene/100131205, fixed now), where this 16592263 is coming from. So I wonder if it is possible to enforce some check on the ensemblID (Is it TRUE that all Ensembl gene id starts with ENS???scratching my head now). The good thing is that this is quite a unique case, and removing it is easy while nothing is lost.  This is true for at least 2 versions (org.Hs.eg.db_3.0.0, org.Hs.eg.db_3.1.2) I have checked, so it has been there for a while.

We are reporting this to NCBI as well... some more affected people...

http://www.genome.jp/dbget-bin/www_bget?hsa:100131205

http://www.broadinstitute.org/~atsankov/ChIP_data/superEnhancers/SE5k_H3K4me3_h64/Ann.out

Best,

Dan

org.Hs.eg.db • 1.8k views
ADD COMMENT
0
Entering edit mode
Marc Carlson ★ 7.2k
@marc-carlson-2264
Last seen 8.3 years ago
United States

You are correct that the data is just coming to us like that from NCBI. 

http://www.ncbi.nlm.nih.gov/gene/?term=100131205

You are also correct that I *could* easily filter out all IDs that don't start with 'ENS'.  And if I did that process, it would only remove that single goofy entry... 

library(org.Hs.eg.db)
ensk <- keys(org.Hs.eg.db, keytype ='ENSEMBL')
table(grepl('ENS', ensk))

But do you really want us 'pre-cleaning' this data for you?  What if your research project was to try and study the accuracy of NCBI data?  If so, I think we could have ruined if for you by filtering it.  So as much as is possible, I think we should probably try to give you the data in a format that is as close to the original thing as we can.

Also: please feel free to write to NCBI and ask them about problems that they may have with their data.

 

 Marc

ADD COMMENT
0
Entering edit mode

Hi Marc,

Fair enough, I agree forcing a removal is kind of crossing the line here, this should be the job of NCBI. However I would still suggest at least issue a warning when certain keys returned by keys(x, keytypes='sometype') clearly violating the naming convention. Here is a small patch for AnnotationDbi in devel.

===================================================================

--- methods-geneCentricDbs.R    (revision 103184)

+++ methods-geneCentricDbs.R    (working copy)

@@ -1296,8 +1296,11 @@

       if(missing(keytype)){

         keytype <- .chooseCentralOrgPkgSymbol(x)

       }

-      smartKeys(x=x, keytype=keytype, ..., FUN=.keys)

-  }

+      res<-smartKeys(x=x, keytype=keytype, ..., FUN=.keys)

+      if(grepl('ENSEMBL', keytype) & any(!grepl('ENS', res))) {

+        warning('Some keys returned may not be valid of type ', keytype)

+      }  

+    }

 ) 

#seems not that simple to patch, forget it

Best,

Dan

ADD REPLY
0
Entering edit mode

Hi Dan

 

...but be carefull:

ensembl identifier do not always start with 'ENS'. Especially in those species, where the annotation is imported from other resources, e.g. C. elegans:

> library(org.Ce.eg.db)

> select(org.Ce.eg.db, keys='180445', columns=c('SYMBOL','ENSEMBL'), keytype ='ENTREZID')
  ENTREZID SYMBOL        ENSEMBL
1   180445  sms-2 WBGene00004893
>

 

see also: http://www.ensembl.org/Caenorhabditis_elegans/Gene/Summary?db=core;g=WBGene00004893

 

Regards, Hans-Rudolf

 

ADD REPLY
0
Entering edit mode

Hi Hans-Rudolf,

Thanks for the example, I thought this might be the case, obviously have not been working with worms before. But good to know, so no simple way to check this then. 

Case closed.

Best,

Dan

ADD REPLY

Login before adding your answer.

Traffic: 620 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6