Search
Question: Biomart ensembl annotation does not completely match my gene names
0
gravatar for spromanos
8 weeks ago by
spromanos0
USA/Boston/Dana-Farber Cancer Institute, Broad Institute
spromanos0 wrote:

I am using getBM to get start/end positions for my genes, however biomart ensembl hgnz_names only contain 19869 out of my 29581 genes. In my dataset, gene names are HUGO, too. It seems that around 10,000 of my genes are not included in the total of 36,713 genes in biomart ensembl. I checked to see if there is a potential naming difference, but I couldn't find anything incriminating. Does anyone have an idea why that happens? Thanks!

ADD COMMENTlink modified 8 weeks ago by Hervé Pagès ♦♦ 13k • written 8 weeks ago by spromanos0

Hi,

A few things that might help understand what's going on:

What's "biomart ensembl hgnz_names"? I see attributes hgnc_idhgnc_symbol, and hgnc_trans_name in the hsapiens_gene_ensembl dataset of the ENSEMBL_MART_ENSEMBL mart. Is this what you are talking about?

Where did you get your 29581 HUGO genes from?

How do you count 36,713 genes in "biomart ensembl". I see 63,967 genes there:

library(biomaRt)
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL",
                dataset="hsapiens_gene_ensembl")
bm_result0 <- getBM("ensembl_gene_id", mart=mart)
head(bm_result0)
#   ensembl_gene_id
# 1 ENSG00000000003
# 2 ENSG00000000005
# 3 ENSG00000000419
# 4 ENSG00000000457
# 5 ENSG00000000460
# 6 ENSG00000000938
length(unique(bm_result0$ensembl_gene_id))
# [1] 63967

Maybe you meant that only 36,713 HUGO genes are actually mapped to the Ensembl genes?

bm_result <- getBM(c("ensembl_gene_id", "hgnc_id"), mart=mart)
head(bm_result)
#   ensembl_gene_id   hgnc_id
# 1 ENSG00000210049 HGNC:7481
# 2 ENSG00000211459 HGNC:7470
# 3 ENSG00000210077 HGNC:7500
# 4 ENSG00000210082 HGNC:7471
# 5 ENSG00000209082 HGNC:7490
# 6 ENSG00000198888 HGNC:7455
dim(bm_result)
# [1] 64029     2
length(unique(bm_result$hgnc_id))
# [1] 36714  # <-- pretty close to 36,713 ;-)

Note that a small number of Ensembl genes seem to be mapped to more than 1 HUGO gene:

length(unique(bm_result$ensembl_gene_id))
# [1] 63967

so the mapping between Ensembl genes and HUGO genes is actually a many-to-many mapping.

How did you map your HUGO genes to the Ensembl genes? Could you share the code?

Could you share some of your HUGO genes that are not mapped to an Ensembl gene?

Thanks,

H.

ADD REPLYlink written 8 weeks ago by Hervé Pagès ♦♦ 13k

Hi Herve, 

I have made some progress, but there are still missing genes.

I am using DESeq2 for DE analysis and a tab delimited file a bioinformatician shared with me containing ensembl gene IDs and HUGO symbols. I have since asked them where the file originated from, as I do find discrepancies; mainly, dots and numbers following ensembl ID. See here: 

ENSG00000000003.10 TSPAN6

ENSG00000000005.5 TNMD

ENSG00000000419.8 DPM1

ENSG00000000457.9 SCYL3

ENSG00000000460.12 C1orf112

ENSG00000000938.8 FGR

ENSG00000000971.11 CFH

ENSG00000001036.9 FUCA2

ENSG00000001084.6 GCLC

ENSG00000001167.10 NFYA

ENSG00000001460.13 STPG1

ENSG00000001461.12 NIPAL3

ENSG00000001497.12 LAS1L

ENSG00000001561.6 ENPP4

ENSG00000001617.7 SEMA3F

It looks like Gencode ID. I gsubbed the dots and numbers away, and I wrote this:

mart = useMart("ensembl", dataset="hsapiens_gene_ensembl") 

results.test=getBM(attributes = c("ensembl_gene_id","external_gene_name", "hgnc_symbol","entrezgene", "chromosome_name", "start_position", "end_position","gene_biotype"), filters = "ensembl_gene_id", values = df$gene_id, mart = mart)

>sum(results.test$ensembl_gene_id %in% df$gene_id)
[1] 28560
> sum(df$gene_id %in% results.test$ensembl_gene_id)
[1] 28358

>length(unique(results.test$external_gene_name))

[1] 28344

This time I got 28358 matches, out of a total of 29581 genes I got from my DESeq2 analysis, which is a lot better than when I was using HUGO symbols to filter biomart. However, as you can see, there is a weird discrepancy depending on how I run it, which I cannot readily explain. I looked up the genes that did not match and these are the top 20.

head(diff, n = 20)
 [1] "ENSG00000005955" "ENSG00000006074" "ENSG00000006075" "ENSG00000006114"
 [5] "ENSG00000017373" "ENSG00000017621" "ENSG00000031544" "ENSG00000034063"
 [9] "ENSG00000049319" "ENSG00000056661" "ENSG00000068793" "ENSG00000073009"
[13] "ENSG00000077809" "ENSG00000083842" "ENSG00000087916" "ENSG00000090920"
[17] "ENSG00000102069" "ENSG00000102080" "ENSG00000104725" "ENSG00000105663"

 sum(diff %in% bm_result0)
[1] 0

Interestingly, when I look up these ensembl IDs, I do find the genes connected to them, however on Ensembl I find: "This identifier is not in the current EnsEMBL database". I don't know what to make of this. 

When I use HUGO symbols to filter my mart, things get much worse:

results.hugo=getBM(attributes = c("ensembl_gene_id","external_gene_name", "hgnc_symbol","entrezgene", "chromosome_name", "start_position", "end_position","gene_biotype"), filters = "hgnc_symbol", values = df$gene_name, mart = mart)

I get 22421 genes, out of which 19804 are unique and I find a similar discrepancy depending on how I run %in%, which I think cannot be explained by the different total number of genes in the comparisons.  

length(unique(results.hugo$hgnc_symbol))
[1] 19804

sum(results.hugo$hgnc_symbol %in% df$gene_name)
[1] 22420
> sum(df$gene_name %in% results.hugo$hgnc_symbol)
[1] 19869
 

Thanks so much for the help!

R.

 

 

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by spromanos0
0
gravatar for Hervé Pagès
8 weeks ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Note that your tab delimited file seems to contain a bunch of gene IDs that don't belong to the current Gencode release (Release 27) for Human. You can easily check this with:

1) Download the GFF3 file containing the "Comprehensive gene annotation" for ALL regions from

    https://www.gencodegenes.org/releases/current.html

2) Then:

library(GenomicFeatures)
txdb <- makeTxDbFromGFF("gencode.v27.chr_patch_hapl_scaff.annotation.gff3.gz")
tx_by_gene <- transcriptsBy(txdb, by="gene")
table(df$gene_id %in% names(tx_by_gene))
# FALSE  TRUE 
#    13     2 

So 13 out of the 15 genes in your above list are not valid Human Gencode IDs! Could it be that they are from another organism and/or from an old (and obsolete) version of Gencode?

This questions the accuracy/correctness of this file. Knowing more about how/when the file was generated and how the Gencode IDs in it were mapped to HUGO symbols might help shed some light on your question.

H.

ADD COMMENTlink written 8 weeks ago by Hervé Pagès ♦♦ 13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 305 users visited in the last hour