RefSeq coordinates from biomaRt
1
0
Entering edit mode
Dave Tang ▴ 210
@dave-tang-4661
Last seen 3.4 years ago
Australia/Perth/UWA
Hello, I've been using biomaRt to fetch genomic coordinates of RefSeqs (perhaps in an incorrect manner?). I found that the RefSeq coordinates don't match the coordinates provided at the UCSC Genome Browser (NM_033453 at chr20:3190006-3204516): library("biomaRt") ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl") getBM(attributes=c('refseq_mrna','chromosome_name','start_position','e nd_position','strand'), filters = 'refseq_mrna', values = 'NM_033453', mart = ensembl) refseq_mrna chromosome_name start_position end_position strand 1 NM_033453 20 3189514 3204516 1 The coordinates seem to match this Ensembl transcript (ENST00000483354) instead: getBM(attributes=c('ensembl_transcript_id','chromosome_name','start_po sition','end_position','strand'), filters = 'ensembl_transcript_id', values = 'ENST00000483354', mart = ensembl) ensembl_transcript_id chromosome_name start_position end_position strand 1 ENST00000483354 20 3189514 3204516 1 Here's another RefSeq model, NM_181493, which should be mapped to chr20:3190134-3204516: getBM(attributes=c('refseq_mrna','chromosome_name','start_position','e nd_position','strand'), filters = 'refseq_mrna', values = 'NM_181493', mart = ensembl) refseq_mrna chromosome_name start_position end_position strand 1 NM_181493 20 3189514 3204516 1 So it seems the RefSeq IDs are mapped to the longest Ensembl transcript model that covers the RefSeq model. I searched around the web and looked at different available marts but nothing obvious popped out. How should I go about obtaining RefSeq coordinates using biomaRt? Or biomaRt is Ensembl centric? sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252 [4] LC_NUMERIC=C LC_TIME=English_Australia.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.16.0 loaded via a namespace (and not attached): [1] RCurl_1.95-4.1 tools_3.0.2 XML_3.98-1.1 Cheers, -- Dave
biomaRt biomaRt • 2.8k views
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 7 weeks ago
United States
On Mon, Nov 25, 2013 at 3:47 AM, Dave Tang <davetingpongtang@gmail.com>wrote: > Hello, > > I've been using biomaRt to fetch genomic coordinates of RefSeqs (perhaps > in an incorrect manner?). I found that the RefSeq coordinates don't match > the coordinates provided at the UCSC Genome Browser (NM_033453 at > chr20:3190006-3204516): > > library("biomaRt") > ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl") > getBM(attributes=c('refseq_mrna','chromosome_name',' > start_position','end_position','strand'), > filters = 'refseq_mrna', values = 'NM_033453', mart = ensembl) > > refseq_mrna chromosome_name start_position end_position strand > 1 NM_033453 20 3189514 3204516 1 > > The coordinates seem to match this Ensembl transcript (ENST00000483354) > instead: > > getBM(attributes=c('ensembl_transcript_id','chromosome_ > name','start_position','end_position','strand'), > filters = 'ensembl_transcript_id', values = 'ENST00000483354', mart = > ensembl) > > ensembl_transcript_id chromosome_name start_position end_position > strand > 1 ENST00000483354 20 3189514 3204516 1 > > Here's another RefSeq model, NM_181493, which should be mapped to > chr20:3190134-3204516: > > getBM(attributes=c('refseq_mrna','chromosome_name',' > start_position','end_position','strand'), > filters = 'refseq_mrna', values = 'NM_181493', mart = ensembl) > > refseq_mrna chromosome_name start_position end_position strand > 1 NM_181493 20 3189514 3204516 1 > > So it seems the RefSeq IDs are mapped to the longest Ensembl transcript > model that covers the RefSeq model. I searched around the web and looked at > different available marts but nothing obvious popped out. How should I go > about obtaining RefSeq coordinates using biomaRt? Or biomaRt is Ensembl > centric? > Hi, Dave. There may be multiple issues going on here, so you'll have to do some digging yourself when discrepancies arise like you see here. Working through your first example, keep in mind that neither Ensembl or UCSC are the actual curators of the RefSeq transcripts. NCBI is the source of that annotation. So, if you go to NCBI gene and search for NM_033453 and then play a bit with the Genomic Sequence Viewer, you'll note that the Gene (protein NP_258412.1) is mapped with the coordinates given at UCSC while the mRNA is mapped with the coordinates given by Ensembl. Add to this complication that UCSC does its own mapping of the transcripts (even RefSeq) and you could even have a "unique" set of coordinates given by UCSC (ie., not the same as NCBI or Ensembl). Sean [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
On Mon, 25 Nov 2013 19:31:22 +0900, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: > Hi, Dave. > > There may be multiple issues going on here, so you'll have to do some > digging yourself when discrepancies arise like you see here. Working > through your first example, keep in mind that neither Ensembl or UCSC > are the actual curators of the RefSeq transcripts. NCBI is the source of > that annotation. So, if you go to NCBI gene and search for NM_033453 and > then play a bit with the Genomic Sequence Viewer, you'll note that the > Gene (protein NP_258412.1) is mapped with the coordinates given at UCSC > while the mRNA is mapped with the coordinates given by Ensembl. Add to > this complication that UCSC does its own mapping of the transcripts > (even RefSeq) and you could even have a "unique" set of coordinates > given by UCSC (ie., not the same as NCBI or Ensembl). Hi Sean, thank you for the prompt reply. My aim is to have a set of transcript annotations as opposed to gene annotations; I don't really mind whether they are RefSeqs or Ensembl transcript models. But I keep running into the same problem where the coordinates of either Ensembl or RefSeq transcripts are the coordinates of the Ensembl gene that encompasses all the transcripts, i.e. the longest Ensembl gene. Here's another example: library("biomaRt") ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl") #ENST00000398344 is on chr22:24,313,554-24,316,773 getBM(attributes = c('chromosome_name', 'start_position', 'end_position', 'strand' ), filters = 'ensembl_transcript_id', values = 'ENST00000398344', mart = ensembl) chromosome_name start_position end_position strand 1 22 24313554 24322660 -1 #ENST00000430101 is on chr22:24,315,293-24,316,648 getBM(attributes = c('chromosome_name', 'start_position', 'end_position', 'strand' ), filters = 'ensembl_transcript_id', values = 'ENST00000430101', mart = ensembl) chromosome_name start_position end_position strand 1 22 24313554 24322660 -1 Is it possible to obtain genomic coordinates of Ensembl transcript via biomaRt? sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252 [4] LC_NUMERIC=C LC_TIME=English_Australia.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.18.0 loaded via a namespace (and not attached): [1] RCurl_1.95-4.1 tools_3.0.2 XML_3.98-1.1 Cheers, -- Dave
ADD REPLY
0
Entering edit mode
On Mon, Nov 25, 2013 at 7:39 AM, Dave Tang <davetingpongtang@gmail.com>wrote: > On Mon, 25 Nov 2013 19:31:22 +0900, Sean Davis <sdavis2@mail.nih.gov> > wrote: > > Hi, Dave. >> >> There may be multiple issues going on here, so you'll have to do some >> digging yourself when discrepancies arise like you see here. Working >> through your first example, keep in mind that neither Ensembl or UCSC are >> the actual curators of the RefSeq transcripts. NCBI is the source of that >> annotation. So, if you go to NCBI gene and search for NM_033453 and then >> play a bit with the Genomic Sequence Viewer, you'll note that the Gene >> (protein NP_258412.1) is mapped with the coordinates given at UCSC while >> the mRNA is mapped with the coordinates given by Ensembl. Add to this >> complication that UCSC does its own mapping of the transcripts (even >> RefSeq) and you could even have a "unique" set of coordinates given by UCSC >> (ie., not the same as NCBI or Ensembl). >> > > Hi Sean, > > thank you for the prompt reply. > > My aim is to have a set of transcript annotations as opposed to gene > annotations; I don't really mind whether they are RefSeqs or Ensembl > transcript models. But I keep running into the same problem where the > coordinates of either Ensembl or RefSeq transcripts are the coordinates of > the Ensembl gene that encompasses all the transcripts, i.e. the longest > Ensembl gene. Here's another example: > > > library("biomaRt") > ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl") > #ENST00000398344 is on chr22:24,313,554-24,316,773 > getBM(attributes = c('chromosome_name', > > 'start_position', > 'end_position', > 'strand' > ), > filters = 'ensembl_transcript_id', > values = 'ENST00000398344', > mart = ensembl) > chromosome_name start_position end_position strand > 1 22 24313554 24322660 -1 > > #ENST00000430101 is on chr22:24,315,293-24,316,648 > getBM(attributes = c('chromosome_name', > > 'start_position', > 'end_position', > 'strand' > ), > filters = 'ensembl_transcript_id', > values = 'ENST00000430101', > mart = ensembl) > chromosome_name start_position end_position strand > 1 22 24313554 24322660 -1 > > Is it possible to obtain genomic coordinates of Ensembl transcript via > biomaRt? Hi, Dave. You'll want to use "transcript_start" and "transcript_end" rather than "start_position" and "end_position". Sean [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
On Mon, 25 Nov 2013 22:04:23 +0900, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: > You'll want to use "transcript_start" and "transcript_end" rather than > "start_position" and "end_position". Thank you so much Sean! I spent the entire afternoon looking in all the wrong places. Cheers, -- Dave
ADD REPLY

Login before adding your answer.

Traffic: 383 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6