Question: Ensembl transcripts returned by getBM not matching the biomart website
gravatar for hihi.joshi
10 months ago by
hihi.joshi0 wrote:


I am writing a workflow to extract Ensembl transcripts using the getBM facility.

This is a portion of the code.


ensembl_mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="")
organism <- useDataset(dataset = "hsapiens_gene_ensembl", mart = ensembl_mart)
txdb <- makeTxDbFromBiomart(dataset = "hsapiens_gene_ensembl")
trs <- transcripts(txdb)

genes <- getBM(c("chromosome_name","ensembl_gene_id","ensembl_transcript_id","transcript_start","transcript_end","hgnc_symbol"), 
               filters = "biotype",
               values = c("protein_coding"), 
               mart = organism)


The results are as shown below:

  chromosome_name ensembl_gene_id ensembl_transcript_id transcript_start transcript_end hgnc_symbol
1              17 ENSG00000109072       ENST00000226218         26694297       26697843         VTN
2              17 ENSG00000109072       ENST00000226218         26694297       26697843       SEBOX

Now if you were to obtain the same results using the grch37 ensembl biomart (I've provided the URL here ... You just need to click "Results"). The results are as shown below:

Gene stable ID    Transcript stable ID    Transcript start (bp)    Transcript end (bp)    Gene name
ENSG00000109072    ENST00000226218    26694297    26697843    VTN

This leads me to think that either getBM is doing something really odd or I'm making a mistake somewhere.

Would appreciate if someone could shed some light on this.



ADD COMMENTlink modified 10 months ago • written 10 months ago by hihi.joshi0
gravatar for hihi.joshi
10 months ago by
hihi.joshi0 wrote:

Okay ... I just realised what I was doing incorrectly.

Gene name and HGNC symbol are two distinctly different fields in the biomart dataset. I was retrieving hgnc_symbol via getBM and gene_name via Biomart web interface.

  • Once I fixed the getBM to ensure the exact same fields are being retrieved, then both results match
  • I don't know why hgnc_symbol and gene_name different (for the same gene) but I'm guessing it has something to do with what gene names were originally designated and which where subsequently corrected through chromosome patches (e.g. HG883_PATCH ... This one incidentally updates the locations for two of the genes on Chr 17 VTN & SEBOX. Both of these genes were previously annotated as being completely overlapping)
ADD COMMENTlink written 10 months ago by hihi.joshi0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 211 users visited in the last hour