Question

ENSEMBL records not complete in org.Mm.eg.db?

0

Entering edit mode

Ming Wang • 0

@ming-wang-8281

Last seen 4.1 years ago

United States

When I convert gene_id from ENSEMBL to SYMBOL or ENTREZID using org.Mm.eg.db package, most of the genes failed.

So I checked the package org.Mm.eg.db:

It turns out that, a total of 71927 ENTREZID records, but only 33045 ENSEMBL records.

only 45.7% of the ENTREZID could be converted to ENSEMBL.

library(org.Mm.eg.db)
orgdb <- org.Mm.eg.db
g1 <- keys(orgdb, "ENSEMBL")
length(g1)

g2 <- keys(orgdb, "ENTREZID")
length(g2)

g2e <- mapIds(orgdb, keys = g2, column = "ENSEMBL", keytype =  "ENTREZID", multiVals = "first")
g2e <- g2e[! is.na(g2e)]
length(g2e)

length(g2e) / length(g2)

org.Mm.eg.db • 1.5k views

ADD COMMENT • link updated 4.7 years ago by Kevin Blighe ★ 4.0k • written 4.7 years ago by Ming Wang • 0

score 0 · Answer 1 · 2021-05-16

Hi,

You should not expect a complete mapping of IDs across different 'key types', with key types in this case representing different annotation databases such as Ensembl, MGI symbols, RefSeq / Entrez, VEGA, etc.

Each annotation database has different rules about what to include. There are many postings on this all across the World Wide Web, for example:

As the [I assume] analyst, you can set rules about what to do with these 'unmapped' IDs. Most will be predicted genes that were found to have negligible expression in some experiments. Keep in mind, also, that there are many thousands of processed and unprocessed pseudogenes in the genome.

Kevin