Question

BioMart missing IDs

0

Entering edit mode

rina ▴ 30

@rina-16738

Last seen 19 months ago

France

Looking at the NAs that came up after mapping Ensembl IDs to Entrez IDs using BioMart, I randomly checked one (ENSG00000018607) and it is linked to an Entrez ID that was yet not found. Any ideas what might be the reason?

This is the code I used

mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
    genes.entrez <- getBM(
      filters="ensembl_gene_id",
      attributes=c("ensembl_gene_id", "entrezgene"),
      values=genes.nodot,
      mart=mart)

Note that I had originally a data frame with raw counts of expression data mapped to Ensembl IDs of the form

 [1] "ENSG00000000005.5"  "ENSG00000000419.11" "ENSG00000000457.12" "ENSG00000000460.15" "ENSG00000000938.11" "ENSG00000000971.14" "ENSG00000001036.12" "ENSG00000001084.9" 
[9] "ENSG00000001167.13"

So I removed the dot suffix to do the mapping.

The results I get after the mapping look like this.

ensembl_gene_id entrezgene
1 ENSG00000000005      64102
2 ENSG00000001561      22875
3 ENSG00000004478       2288
4 ENSG00000004799       5166
5 ENSG00000005022        292
6 ENSG00000005073       3207

Every kind of help would be much appreciated, as I am pretty new to using R.

biomart ensembl entrez gene identifiers • 3.0k views

ADD COMMENT • link updated 6.7 years ago by James W. MacDonald 68k • written 6.7 years ago by rina ▴ 30

score 2 · Answer 1 · 2018-08-02

Please note that the example you give isn't a good example. The Ensembl ID ENSG00000018607, is, according to Ensembl, an unprocessed pseudogene. Entrez Gene, on the other hand, says it's a coding gene. These are not the same thing! So I think it's good that the Biomart isn't saying they are. This is actually spelled out on the Ensembl page where it says that there is an overlapping gene in Entrez Gene, but that the two groups differ as to what the underlying thing is.

Mapping between the two annotation services is a fraught enterprise, and I try to avoid doing so if at all possible, because there are any number of little technical details like this and it's not clear who is right. I mean, you have two groups with lots of people who spend lots of time trying to figure this stuff out, and they disagree over a fundamental issue of whether or not this thing is a pseudogene that doesn't get expressed, or a real gene that codes for proteins. And that's just one gene (or not). There is no way for one person to resolve these conflicts, particularly in bulk, using programmatic methods. So you should either simply accept what mappings you get, or just stick with a single annotation service, and be clear about which one you used.