Question: Ensembl transcripts returned by getBM not matching the biomart website
0
21 months ago by
hihi.joshi0 wrote:

Hi,

I am writing a workflow to extract Ensembl transcripts using the getBM facility.

This is a portion of the code.

library(biomaRt)
library(AnnotationHub)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
rm(list=ls())

ensembl_mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org")
organism <- useDataset(dataset = "hsapiens_gene_ensembl", mart = ensembl_mart)
txdb <- makeTxDbFromBiomart(dataset = "hsapiens_gene_ensembl")
trs <- transcripts(txdb)

genes <- getBM(c("chromosome_name","ensembl_gene_id","ensembl_transcript_id","transcript_start","transcript_end","hgnc_symbol"),
filters = "biotype",
values = c("protein_coding"),
mart = organism)

genes[genes\$ensembl_transcript_id=="ENST00000226218",]

The results are as shown below:

  chromosome_name ensembl_gene_id ensembl_transcript_id transcript_start transcript_end hgnc_symbol 1              17 ENSG00000109072       ENST00000226218         26694297       26697843         VTN 2              17 ENSG00000109072       ENST00000226218         26694297       26697843       SEBOX

Now if you were to obtain the same results using the grch37 ensembl biomart (I've provided the URL here ... You just need to click "Results"). The results are as shown below:

Gene stable ID    Transcript stable ID    Transcript start (bp)    Transcript end (bp)    Gene name ENSG00000109072    ENST00000226218    26694297    26697843    VTN

This leads me to think that either getBM is doing something really odd or I'm making a mistake somewhere.

Would appreciate if someone could shed some light on this.

Thanks,

Himanshu

modified 21 months ago • written 21 months ago by hihi.joshi0
Answer: Ensembl transcripts returned by getBM not matching the biomart website
1
21 months ago by
hihi.joshi0 wrote:

Okay ... I just realised what I was doing incorrectly.

Gene name and HGNC symbol are two distinctly different fields in the biomart dataset. I was retrieving hgnc_symbol via getBM and gene_name via Biomart web interface.

• Once I fixed the getBM to ensure the exact same fields are being retrieved, then both results match
• I don't know why hgnc_symbol and gene_name different (for the same gene) but I'm guessing it has something to do with what gene names were originally designated and which where subsequently corrected through chromosome patches (e.g. HG883_PATCH ... This one incidentally updates the locations for two of the genes on Chr 17 VTN & SEBOX. Both of these genes were previously annotated as being completely overlapping)