Search
Question: Faulty entry in org.Hs.eg.db, a force check perhaps?
0
3.6 years ago by
Dan Du210
Germany
Dan Du210 wrote:

Hi Marc and others,

Recently a funny entry popup in org.Hs.eg.db during some internal mapping checkup,

library(org.Hs.eg.db)
select(org.Hs.eg.db, keys='16592263', columns=c('SYMBOL','ENTREZID'), keytype ='ENSEMBL')
#   ENSEMBL   SYMBOL  ENTREZID
#1 16592263 RPL21P28 100131205
select(org.Hs.eg.db, keys='100131205', columns=c('SYMBOL','ENSEMBL'))
#   ENTREZID   SYMBOL         ENSEMBL
#1 100131205 RPL21P28        16592263
#2 100131205 RPL21P28 ENSG00000220749
#3 100131205 RPL21P28 ENSG00000213860 # only in 3.0.0, removed now
grep('ENSG', keys(org.Hs.eg.db, keytype = 'ENSEMBL'), invert=T, value=T)
#16592263

I guess this is more a problem upstream from NCBI (http://www.ncbi.nlm.nih.gov/gene/100131205, fixed now), where this 16592263 is coming from. So I wonder if it is possible to enforce some check on the ensemblID (Is it TRUE that all Ensembl gene id starts with ENS???scratching my head now). The good thing is that this is quite a unique case, and removing it is easy while nothing is lost.  This is true for at least 2 versions (org.Hs.eg.db_3.0.0, org.Hs.eg.db_3.1.2) I have checked, so it has been there for a while.

We are reporting this to NCBI as well... some more affected people...

http://www.genome.jp/dbget-bin/www_bget?hsa:100131205

Best,

Dan

modified 3.6 years ago • written 3.6 years ago by Dan Du210
0
3.6 years ago by
Marc Carlson7.2k
United States
Marc Carlson7.2k wrote:

You are correct that the data is just coming to us like that from NCBI.

http://www.ncbi.nlm.nih.gov/gene/?term=100131205

You are also correct that I *could* easily filter out all IDs that don't start with 'ENS'.  And if I did that process, it would only remove that single goofy entry...

library(org.Hs.eg.db)
ensk <- keys(org.Hs.eg.db, keytype ='ENSEMBL')
table(grepl('ENS', ensk))

But do you really want us 'pre-cleaning' this data for you?  What if your research project was to try and study the accuracy of NCBI data?  If so, I think we could have ruined if for you by filtering it.  So as much as is possible, I think we should probably try to give you the data in a format that is as close to the original thing as we can.

Also: please feel free to write to NCBI and ask them about problems that they may have with their data.

Marc

Hi Marc,

Fair enough, I agree forcing a removal is kind of crossing the line here, this should be the job of NCBI. However I would still suggest at least issue a warning when certain keys returned by keys(x, keytypes='sometype') clearly violating the naming convention. Here is a small patch for AnnotationDbi in devel.

===================================================================

--- methods-geneCentricDbs.R    (revision 103184)

+++ methods-geneCentricDbs.R    (working copy)

@@ -1296,8 +1296,11 @@

if(missing(keytype)){

keytype <- .chooseCentralOrgPkgSymbol(x)

}

-      smartKeys(x=x, keytype=keytype, ..., FUN=.keys)

-  }

+      res<-smartKeys(x=x, keytype=keytype, ..., FUN=.keys)

+      if(grepl('ENSEMBL', keytype) & any(!grepl('ENS', res))) {

+        warning('Some keys returned may not be valid of type ', keytype)

+      }

+    }

) 

#seems not that simple to patch, forget it

Best,

Dan

Hi Dan

...but be carefull:

ensembl identifier do not always start with 'ENS'. Especially in those species, where the annotation is imported from other resources, e.g. C. elegans:

> library(org.Ce.eg.db)

> select(org.Ce.eg.db, keys='180445', columns=c('SYMBOL','ENSEMBL'), keytype ='ENTREZID')
ENTREZID SYMBOL        ENSEMBL
1   180445  sms-2 WBGene00004893
>

Regards, Hans-Rudolf

Hi Hans-Rudolf,

Thanks for the example, I thought this might be the case, obviously have not been working with worms before. But good to know, so no simple way to check this then.

Case closed.

Best,

Dan

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.