Ensembl transcripts returned by getBM not matching the biomart website
1
0
Entering edit mode
hihi.joshi • 0
@hihijoshi-14881
Last seen 6.8 years ago

Hi, 

I am writing a workflow to extract Ensembl transcripts using the getBM facility.

This is a portion of the code.

library(biomaRt)
library(AnnotationHub)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
rm(list=ls())

ensembl_mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org")
organism <- useDataset(dataset = "hsapiens_gene_ensembl", mart = ensembl_mart)
txdb <- makeTxDbFromBiomart(dataset = "hsapiens_gene_ensembl")
trs <- transcripts(txdb)

genes <- getBM(c("chromosome_name","ensembl_gene_id","ensembl_transcript_id","transcript_start","transcript_end","hgnc_symbol"), 
               filters = "biotype",
               values = c("protein_coding"), 
               mart = organism)

genes[genes$ensembl_transcript_id=="ENST00000226218",]

The results are as shown below:

  chromosome_name ensembl_gene_id ensembl_transcript_id transcript_start transcript_end hgnc_symbol
1              17 ENSG00000109072       ENST00000226218         26694297       26697843         VTN
2              17 ENSG00000109072       ENST00000226218         26694297       26697843       SEBOX

Now if you were to obtain the same results using the grch37 ensembl biomart (I've provided the URL here ... You just need to click "Results"). The results are as shown below:

Gene stable ID    Transcript stable ID    Transcript start (bp)    Transcript end (bp)    Gene name
ENSG00000109072    ENST00000226218    26694297    26697843    VTN

This leads me to think that either getBM is doing something really odd or I'm making a mistake somewhere.

Would appreciate if someone could shed some light on this.

Thanks,

Himanshu

ensembl transcripts getbm biomart • 1.3k views
ADD COMMENT
1
Entering edit mode
hihi.joshi • 0
@hihijoshi-14881
Last seen 6.8 years ago

Okay ... I just realised what I was doing incorrectly.

Gene name and HGNC symbol are two distinctly different fields in the biomart dataset. I was retrieving hgnc_symbol via getBM and gene_name via Biomart web interface.

  • Once I fixed the getBM to ensure the exact same fields are being retrieved, then both results match
  • I don't know why hgnc_symbol and gene_name different (for the same gene) but I'm guessing it has something to do with what gene names were originally designated and which where subsequently corrected through chromosome patches (e.g. HG883_PATCH ... This one incidentally updates the locations for two of the genes on Chr 17 VTN & SEBOX. Both of these genes were previously annotated as being completely overlapping)
ADD COMMENT

Login before adding your answer.

Traffic: 580 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6