Question

Biomart ensembl annotation does not completely match my gene names

0

Entering edit mode

spr ▴ 10

@spr-13965

Last seen 5.9 years ago

I am using getBM to get start/end positions for my genes, however biomart ensembl hgnz_names only contain 19869 out of my 29581 genes. In my dataset, gene names are HUGO, too. It seems that around 10,000 of my genes are not included in the total of 36,713 genes in biomart ensembl. I checked to see if there is a potential naming difference, but I couldn't find anything incriminating. Does anyone have an idea why that happens? Thanks!

biomart ensemblbiomart • 4.8k views

ADD COMMENT • link updated 8.4 years ago by Hervé Pagès 16k • written 8.4 years ago by spr ▴ 10

0

Entering edit mode

Hi,

A few things that might help understand what's going on:

What's "biomart ensembl hgnz_names"? I see attributes hgnc_id, hgnc_symbol, and hgnc_trans_name in the hsapiens_gene_ensembl dataset of the ENSEMBL_MART_ENSEMBL mart. Is this what you are talking about?

Where did you get your 29581 HUGO genes from?

How do you count 36,713 genes in "biomart ensembl". I see 63,967 genes there:

library(biomaRt)
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL",
                dataset="hsapiens_gene_ensembl")
bm_result0 <- getBM("ensembl_gene_id", mart=mart)
head(bm_result0)
#   ensembl_gene_id
# 1 ENSG00000000003
# 2 ENSG00000000005
# 3 ENSG00000000419
# 4 ENSG00000000457
# 5 ENSG00000000460
# 6 ENSG00000000938
length(unique(bm_result0$ensembl_gene_id))
# [1] 63967

Maybe you meant that only 36,713 HUGO genes are actually mapped to the Ensembl genes?

bm_result <- getBM(c("ensembl_gene_id", "hgnc_id"), mart=mart)
head(bm_result)
#   ensembl_gene_id   hgnc_id
# 1 ENSG00000210049 HGNC:7481
# 2 ENSG00000211459 HGNC:7470
# 3 ENSG00000210077 HGNC:7500
# 4 ENSG00000210082 HGNC:7471
# 5 ENSG00000209082 HGNC:7490
# 6 ENSG00000198888 HGNC:7455
dim(bm_result)
# [1] 64029     2
length(unique(bm_result$hgnc_id))
# [1] 36714  # <-- pretty close to 36,713 ;-)

Note that a small number of Ensembl genes seem to be mapped to more than 1 HUGO gene:

length(unique(bm_result$ensembl_gene_id))
# [1] 63967

so the mapping between Ensembl genes and HUGO genes is actually a many-to-many mapping.

How did you map your HUGO genes to the Ensembl genes? Could you share the code?

Could you share some of your HUGO genes that are not mapped to an Ensembl gene?

Thanks,

H.

ADD REPLY • link 8.4 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Herve,

I have made some progress, but there are still missing genes.

I am using DESeq2 for DE analysis and a tab delimited file a bioinformatician shared with me containing ensembl gene IDs and HUGO symbols. I have since asked them where the file originated from, as I do find discrepancies; mainly, dots and numbers following ensembl ID. See here:

ENSG00000000003.10 TSPAN6

ENSG00000000005.5 TNMD

ENSG00000000419.8 DPM1

ENSG00000000457.9 SCYL3

ENSG00000000460.12 C1orf112

ENSG00000000938.8 FGR

ENSG00000000971.11 CFH

ENSG00000001036.9 FUCA2

ENSG00000001084.6 GCLC

ENSG00000001167.10 NFYA

ENSG00000001460.13 STPG1

ENSG00000001461.12 NIPAL3

ENSG00000001497.12 LAS1L

ENSG00000001561.6 ENPP4

ENSG00000001617.7 SEMA3F

It looks like Gencode ID. I gsubbed the dots and numbers away, and I wrote this:

mart = useMart("ensembl", dataset="hsapiens_gene_ensembl")

results.test=getBM(attributes = c("ensembl_gene_id","external_gene_name", "hgnc_symbol","entrezgene", "chromosome_name", "start_position", "end_position","gene_biotype"), filters = "ensembl_gene_id", values = df$gene_id, mart = mart)

>sum(results.test$ensembl_gene_id %in% df$gene_id)
[1] 28560
> sum(df$gene_id %in% results.test$ensembl_gene_id)
[1] 28358

>length(unique(results.test$external_gene_name))

[1] 28344

This time I got 28358 matches, out of a total of 29581 genes I got from my DESeq2 analysis, which is a lot better than when I was using HUGO symbols to filter biomart. However, as you can see, there is a weird discrepancy depending on how I run it, which I cannot readily explain. I looked up the genes that did not match and these are the top 20.

head(diff, n = 20)
[1] "ENSG00000005955" "ENSG00000006074" "ENSG00000006075" "ENSG00000006114"
[5] "ENSG00000017373" "ENSG00000017621" "ENSG00000031544" "ENSG00000034063"
[9] "ENSG00000049319" "ENSG00000056661" "ENSG00000068793" "ENSG00000073009"
[13] "ENSG00000077809" "ENSG00000083842" "ENSG00000087916" "ENSG00000090920"
[17] "ENSG00000102069" "ENSG00000102080" "ENSG00000104725" "ENSG00000105663"

sum(diff %in% bm_result0)
[1] 0

Interestingly, when I look up these ensembl IDs, I do find the genes connected to them, however on Ensembl I find: "This identifier is not in the current EnsEMBL database". I don't know what to make of this.

When I use HUGO symbols to filter my mart, things get much worse:

results.hugo=getBM(attributes = c("ensembl_gene_id","external_gene_name", "hgnc_symbol","entrezgene", "chromosome_name", "start_position", "end_position","gene_biotype"), filters = "hgnc_symbol", values = df$gene_name, mart = mart)

I get 22421 genes, out of which 19804 are unique and I find a similar discrepancy depending on how I run %in%, which I think cannot be explained by the different total number of genes in the comparisons.

length(unique(results.hugo$hgnc_symbol))
[1] 19804

sum(results.hugo$hgnc_symbol %in% df$gene_name)
[1] 22420
> sum(df$gene_name %in% results.hugo$hgnc_symbol)
[1] 19869

Thanks so much for the help!

R.

ADD REPLY • link 8.4 years ago spr ▴ 10

score 0 · Answer 1 · 2017-09-25

Note that your tab delimited file seems to contain a bunch of gene IDs that don't belong to the current Gencode release (Release 27) for Human. You can easily check this with:

1) Download the GFF3 file containing the "Comprehensive gene annotation" for ALL regions from

https://www.gencodegenes.org/releases/current.html

2) Then:

library(GenomicFeatures)
txdb <- makeTxDbFromGFF("gencode.v27.chr_patch_hapl_scaff.annotation.gff3.gz")
tx_by_gene <- transcriptsBy(txdb, by="gene")
table(df$gene_id %in% names(tx_by_gene))
# FALSE  TRUE 
#    13     2

So 13 out of the 15 genes in your above list are not valid Human Gencode IDs! Could it be that they are from another organism and/or from an old (and obsolete) version of Gencode?

This questions the accuracy/correctness of this file. Knowing more about how/when the file was generated and how the Gencode IDs in it were mapped to HUGO symbols might help shed some light on your question.

H.