I am using getBM to get start/end positions for my genes, however biomart ensembl hgnz_names only contain 19869 out of my 29581 genes. In my dataset, gene names are HUGO, too. It seems that around 10,000 of my genes are not included in the total of 36,713 genes in biomart ensembl. I checked to see if there is a potential naming difference, but I couldn't find anything incriminating. Does anyone have an idea why that happens? Thanks!
Hi,
A few things that might help understand what's going on:
What's "biomart ensembl hgnz_names"? I see attributes
hgnc_id
,hgnc_symbol
, andhgnc_trans_name
in thehsapiens_gene_ensembl
dataset of theENSEMBL_MART_ENSEMBL
mart. Is this what you are talking about?Where did you get your 29581 HUGO genes from?
How do you count 36,713 genes in "biomart ensembl". I see 63,967 genes there:
Maybe you meant that only 36,713 HUGO genes are actually mapped to the Ensembl genes?
Note that a small number of Ensembl genes seem to be mapped to more than 1 HUGO gene:
so the mapping between Ensembl genes and HUGO genes is actually a many-to-many mapping.
How did you map your HUGO genes to the Ensembl genes? Could you share the code?
Could you share some of your HUGO genes that are not mapped to an Ensembl gene?
Thanks,
H.
Hi Herve,
I have made some progress, but there are still missing genes.
I am using DESeq2 for DE analysis and a tab delimited file a bioinformatician shared with me containing ensembl gene IDs and HUGO symbols. I have since asked them where the file originated from, as I do find discrepancies; mainly, dots and numbers following ensembl ID. See here:
ENSG00000000003.10 TSPAN6
ENSG00000000005.5 TNMD
ENSG00000000419.8 DPM1
ENSG00000000457.9 SCYL3
ENSG00000000460.12 C1orf112
ENSG00000000938.8 FGR
ENSG00000000971.11 CFH
ENSG00000001036.9 FUCA2
ENSG00000001084.6 GCLC
ENSG00000001167.10 NFYA
ENSG00000001460.13 STPG1
ENSG00000001461.12 NIPAL3
ENSG00000001497.12 LAS1L
ENSG00000001561.6 ENPP4
ENSG00000001617.7 SEMA3F
It looks like Gencode ID. I gsubbed the dots and numbers away, and I wrote this:
mart = useMart("ensembl", dataset="hsapiens_gene_ensembl")
results.test=getBM(attributes = c("ensembl_gene_id","external_gene_name", "hgnc_symbol","entrezgene", "chromosome_name", "start_position", "end_position","gene_biotype"), filters = "ensembl_gene_id", values = df$gene_id, mart = mart)
>sum(results.test$ensembl_gene_id %in% df$gene_id)
[1] 28560
> sum(df$gene_id %in% results.test$ensembl_gene_id)
[1] 28358
>length(unique(results.test$external_gene_name))
[1] 28344
This time I got 28358 matches, out of a total of 29581 genes I got from my DESeq2 analysis, which is a lot better than when I was using HUGO symbols to filter biomart. However, as you can see, there is a weird discrepancy depending on how I run it, which I cannot readily explain. I looked up the genes that did not match and these are the top 20.
head(diff, n = 20)
[1] "ENSG00000005955" "ENSG00000006074" "ENSG00000006075" "ENSG00000006114"
[5] "ENSG00000017373" "ENSG00000017621" "ENSG00000031544" "ENSG00000034063"
[9] "ENSG00000049319" "ENSG00000056661" "ENSG00000068793" "ENSG00000073009"
[13] "ENSG00000077809" "ENSG00000083842" "ENSG00000087916" "ENSG00000090920"
[17] "ENSG00000102069" "ENSG00000102080" "ENSG00000104725" "ENSG00000105663"
sum(diff %in% bm_result0)
[1] 0
Interestingly, when I look up these ensembl IDs, I do find the genes connected to them, however on Ensembl I find: "This identifier is not in the current EnsEMBL database". I don't know what to make of this.
When I use HUGO symbols to filter my mart, things get much worse:
results.hugo=getBM(attributes = c("ensembl_gene_id","external_gene_name", "hgnc_symbol","entrezgene", "chromosome_name", "start_position", "end_position","gene_biotype"), filters = "hgnc_symbol", values = df$gene_name, mart = mart)
I get 22421 genes, out of which 19804 are unique and I find a similar discrepancy depending on how I run %in%, which I think cannot be explained by the different total number of genes in the comparisons.
length(unique(results.hugo$hgnc_symbol))
[1] 19804
sum(results.hugo$hgnc_symbol %in% df$gene_name)
[1] 22420
> sum(df$gene_name %in% results.hugo$hgnc_symbol)
[1] 19869
Thanks so much for the help!
R.