Question

org.Hs.eg.db gives more than one ENTREZID for a gene symbol

0

Entering edit mode

harish • 0

@d734c3d2

Last seen 14 months ago

Germany

I have a list of gene symbols and when I run the code in r as given below. Just for the two gene symbols TEC and MEMO1, I get two different entrezID. This makes my list output list longer than my input list, further how to resolve this, and can a gene symbol have two Entrez ID.

library(org.Hs.eg.db)

HGNC_symbol <- c("TEC", "MEMO1")

conversion <- AnnotationDbi::select(org.Hs.eg.db, 
       keys = HGNC_symbol,
       columns = c("ENTREZID", "SYMBOL"),
       keytype = "SYMBOL")

org.Hs.eg.db AnnotationDbi • 1.6k views

ADD COMMENT • link updated 18 months ago by James W. MacDonald 66k • written 18 months ago by harish • 0

0

Entering edit mode

ok I understand the trouble here, when I look up for the ENTREZID that I get in the output, I can see that both of the ENTREZID retrieve two different genes, with the same gene symbol for them in NCBI. However, one is approved by the HGNC, and the other is not approved. How can I tell AnnotationDbi to consider my gene symbols as the once approved by HGNC when I retrieve the data for ENTREZID? it is much clear if you look for MEMO1 in NCBI.

Bottom line is, is there a way to specify gene symbols as HGNC gene symbols in AnnotationDbi??

ADD REPLY • link 18 months ago harish • 0

score 1 · Answer 1 · 2023-01-23

There is no way to specify the source of gene symbols for an OrgDb. For TEC, one comes from HGNC, and the other comes from OMIM. When we generate the OrgDb packages, we don't distinguish between sources, as they are all (as far as NCBI is concerned) 'real' gene symbols. Unfortunately, gene symbols are not unique, and come from different sources (and get retired regularly), so one would ideally not use them for anything but presenting data to a biologist, for whom the gene symbol is usually the primary ID.

The easy way to get around this is to use mapIds instead.

> z <- mapIds(org.Hs.eg.db, c("TEC", "MEMO1"), "ENTREZID","SYMBOL")
'select()' returned 1:many mapping between keys and columns
> data.frame(ENTREZID = z, SYMBOL = names(z))
      ENTREZID SYMBOL
TEC       7006    TEC
MEMO1     7795  MEMO1

But do note this is a naive implementation that simply chooses the first choice for each symbol

> mapIds(org.Hs.eg.db, c("TEC", "MEMO1"), "ENTREZID","SYMBOL", multiVals = "list")
'select()' returned 1:many mapping between keys and columns
$TEC
[1] "7006"      "100124696"

$MEMO1
[1] "7795"  "51072"