Question: BioMart missing IDs
gravatar for rina
8 months ago by
rina0 wrote:

Looking at the NAs that came up after mapping Ensembl IDs to Entrez IDs using BioMart, I randomly checked one (ENSG00000018607) and it is linked to an Entrez ID that was yet not found. Any ideas what might be the reason?

This is the code I used

   mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
    genes.entrez <- getBM(
      attributes=c("ensembl_gene_id", "entrezgene"),

Note that I had originally a data frame with raw counts of expression data mapped to Ensembl IDs of the form

 [1] "ENSG00000000005.5"  "ENSG00000000419.11" "ENSG00000000457.12" "ENSG00000000460.15" "ENSG00000000938.11" "ENSG00000000971.14" "ENSG00000001036.12" "ENSG00000001084.9" 
[9] "ENSG00000001167.13"

So I removed the dot suffix to do the mapping.

The results I get after the mapping look like this.

ensembl_gene_id entrezgene
1 ENSG00000000005      64102
2 ENSG00000001561      22875
3 ENSG00000004478       2288
4 ENSG00000004799       5166
5 ENSG00000005022        292
6 ENSG00000005073       3207

Every kind of help would be much appreciated, as I am pretty new to using R.

ADD COMMENTlink modified 8 months ago by James W. MacDonald49k • written 8 months ago by rina0
Answer: BioMart missing IDs
gravatar for James W. MacDonald
8 months ago by
United States
James W. MacDonald49k wrote:

Please note that the example you give isn't a good example. The Ensembl ID ENSG00000018607, is, according to Ensembl, an unprocessed pseudogene. Entrez Gene, on the other hand, says it's a coding gene. These are not the same thing! So I think it's good that the Biomart isn't saying they are. This is actually spelled out on the Ensembl page where it says that there is an overlapping gene in Entrez Gene, but that the two groups differ as to what the underlying thing is.

Mapping between the two annotation services is a fraught enterprise, and I try to avoid doing so if at all possible, because there are any number of little technical details like this and it's not clear who is right. I mean, you have two groups with lots of people who spend lots of time trying to figure this stuff out, and they disagree over a fundamental issue of whether or not this thing is a pseudogene that doesn't get expressed, or a real gene that codes for proteins. And that's just one gene (or not). There is no way for one person to resolve these conflicts, particularly in bulk, using programmatic methods. So you should either simply accept what mappings you get, or just stick with a single annotation service, and be clear about which one you used.

ADD COMMENTlink written 8 months ago by James W. MacDonald49k

That was a great and really helpful answer! I will take your pointers into consideration and try to tweak the workflow in a way that I avoid converting IDs. Much appreciated! 

ADD REPLYlink written 8 months ago by rina0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 109 users visited in the last hour