Question: Biomart ensembl annotation does not completely match my gene names
0
2.1 years ago by
spr0
spr0 wrote:

I am using getBM to get start/end positions for my genes, however biomart ensembl hgnz_names only contain 19869 out of my 29581 genes. In my dataset, gene names are HUGO, too. It seems that around 10,000 of my genes are not included in the total of 36,713 genes in biomart ensembl. I checked to see if there is a potential naming difference, but I couldn't find anything incriminating. Does anyone have an idea why that happens? Thanks!

biomart ensemblbiomart • 980 views
modified 2.1 years ago by Hervé Pagès ♦♦ 14k • written 2.1 years ago by spr0

Hi,

A few things that might help understand what's going on:

What's "biomart ensembl hgnz_names"? I see attributes hgnc_idhgnc_symbol, and hgnc_trans_name in the hsapiens_gene_ensembl dataset of the ENSEMBL_MART_ENSEMBL mart. Is this what you are talking about?

Where did you get your 29581 HUGO genes from?

How do you count 36,713 genes in "biomart ensembl". I see 63,967 genes there:

library(biomaRt)
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL",
dataset="hsapiens_gene_ensembl")
bm_result0 <- getBM("ensembl_gene_id", mart=mart)
#   ensembl_gene_id
# 1 ENSG00000000003
# 2 ENSG00000000005
# 3 ENSG00000000419
# 4 ENSG00000000457
# 5 ENSG00000000460
# 6 ENSG00000000938
length(unique(bm_result0$ensembl_gene_id)) # [1] 63967  Maybe you meant that only 36,713 HUGO genes are actually mapped to the Ensembl genes? bm_result <- getBM(c("ensembl_gene_id", "hgnc_id"), mart=mart) head(bm_result) # ensembl_gene_id hgnc_id # 1 ENSG00000210049 HGNC:7481 # 2 ENSG00000211459 HGNC:7470 # 3 ENSG00000210077 HGNC:7500 # 4 ENSG00000210082 HGNC:7471 # 5 ENSG00000209082 HGNC:7490 # 6 ENSG00000198888 HGNC:7455 dim(bm_result) # [1] 64029 2 length(unique(bm_result$hgnc_id))
# [1] 36714  # <-- pretty close to 36,713 ;-)


Note that a small number of Ensembl genes seem to be mapped to more than 1 HUGO gene:

length(unique(bm_result$ensembl_gene_id)) # [1] 63967  so the mapping between Ensembl genes and HUGO genes is actually a many-to-many mapping. How did you map your HUGO genes to the Ensembl genes? Could you share the code? Could you share some of your HUGO genes that are not mapped to an Ensembl gene? Thanks, H. ADD REPLYlink written 2.1 years ago by Hervé Pagès ♦♦ 14k Hi Herve, I have made some progress, but there are still missing genes. I am using DESeq2 for DE analysis and a tab delimited file a bioinformatician shared with me containing ensembl gene IDs and HUGO symbols. I have since asked them where the file originated from, as I do find discrepancies; mainly, dots and numbers following ensembl ID. See here: ENSG00000000003.10 TSPAN6 ENSG00000000005.5 TNMD ENSG00000000419.8 DPM1 ENSG00000000457.9 SCYL3 ENSG00000000460.12 C1orf112 ENSG00000000938.8 FGR ENSG00000000971.11 CFH ENSG00000001036.9 FUCA2 ENSG00000001084.6 GCLC ENSG00000001167.10 NFYA ENSG00000001460.13 STPG1 ENSG00000001461.12 NIPAL3 ENSG00000001497.12 LAS1L ENSG00000001561.6 ENPP4 ENSG00000001617.7 SEMA3F It looks like Gencode ID. I gsubbed the dots and numbers away, and I wrote this: mart = useMart("ensembl", dataset="hsapiens_gene_ensembl") results.test=getBM(attributes = c("ensembl_gene_id","external_gene_name", "hgnc_symbol","entrezgene", "chromosome_name", "start_position", "end_position","gene_biotype"), filters = "ensembl_gene_id", values = df$gene_id, mart = mart)

>sum(results.test$ensembl_gene_id %in% df$gene_id)
[1] 28560
> sum(df$gene_id %in% results.test$ensembl_gene_id)
[1] 28358

>length(unique(results.test$external_gene_name)) [1] 28344 This time I got 28358 matches, out of a total of 29581 genes I got from my DESeq2 analysis, which is a lot better than when I was using HUGO symbols to filter biomart. However, as you can see, there is a weird discrepancy depending on how I run it, which I cannot readily explain. I looked up the genes that did not match and these are the top 20. head(diff, n = 20) [1] "ENSG00000005955" "ENSG00000006074" "ENSG00000006075" "ENSG00000006114" [5] "ENSG00000017373" "ENSG00000017621" "ENSG00000031544" "ENSG00000034063" [9] "ENSG00000049319" "ENSG00000056661" "ENSG00000068793" "ENSG00000073009" [13] "ENSG00000077809" "ENSG00000083842" "ENSG00000087916" "ENSG00000090920" [17] "ENSG00000102069" "ENSG00000102080" "ENSG00000104725" "ENSG00000105663" sum(diff %in% bm_result0) [1] 0 Interestingly, when I look up these ensembl IDs, I do find the genes connected to them, however on Ensembl I find: "This identifier is not in the current EnsEMBL database". I don't know what to make of this. When I use HUGO symbols to filter my mart, things get much worse: results.hugo=getBM(attributes = c("ensembl_gene_id","external_gene_name", "hgnc_symbol","entrezgene", "chromosome_name", "start_position", "end_position","gene_biotype"), filters = "hgnc_symbol", values = df$gene_name, mart = mart)

I get 22421 genes, out of which 19804 are unique and I find a similar discrepancy depending on how I run %in%, which I think cannot be explained by the different total number of genes in the comparisons.

length(unique(results.hugo$hgnc_symbol)) [1] 19804 sum(results.hugo$hgnc_symbol %in% df$gene_name) [1] 22420 > sum(df$gene_name %in% results.hugo$hgnc_symbol) [1] 19869 Thanks so much for the help! R. ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by spr0 Answer: Biomart ensembl annotation does not completely match my gene names 0 2.1 years ago by Hervé Pagès ♦♦ 14k United States Hervé Pagès ♦♦ 14k wrote: Note that your tab delimited file seems to contain a bunch of gene IDs that don't belong to the current Gencode release (Release 27) for Human. You can easily check this with: 1) Download the GFF3 file containing the "Comprehensive gene annotation" for ALL regions from 2) Then: library(GenomicFeatures) txdb <- makeTxDbFromGFF("gencode.v27.chr_patch_hapl_scaff.annotation.gff3.gz") tx_by_gene <- transcriptsBy(txdb, by="gene") table(df$gene_id %in% names(tx_by_gene))
# FALSE  TRUE
#    13     2 

So 13 out of the 15 genes in your above list are not valid Human Gencode IDs! Could it be that they are from another organism and/or from an old (and obsolete) version of Gencode?

This questions the accuracy/correctness of this file. Knowing more about how/when the file was generated and how the Gencode IDs in it were mapped to HUGO symbols might help shed some light on your question.

H.