Search
Question: RefSeq coordinates from biomaRt
0
gravatar for Dave Tang
4.8 years ago by
Dave Tang210
Australia/Perth/UWA
Dave Tang210 wrote:
Hello, I've been using biomaRt to fetch genomic coordinates of RefSeqs (perhaps in an incorrect manner?). I found that the RefSeq coordinates don't match the coordinates provided at the UCSC Genome Browser (NM_033453 at chr20:3190006-3204516): library("biomaRt") ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl") getBM(attributes=c('refseq_mrna','chromosome_name','start_position','e nd_position','strand'), filters = 'refseq_mrna', values = 'NM_033453', mart = ensembl) refseq_mrna chromosome_name start_position end_position strand 1 NM_033453 20 3189514 3204516 1 The coordinates seem to match this Ensembl transcript (ENST00000483354) instead: getBM(attributes=c('ensembl_transcript_id','chromosome_name','start_po sition','end_position','strand'), filters = 'ensembl_transcript_id', values = 'ENST00000483354', mart = ensembl) ensembl_transcript_id chromosome_name start_position end_position strand 1 ENST00000483354 20 3189514 3204516 1 Here's another RefSeq model, NM_181493, which should be mapped to chr20:3190134-3204516: getBM(attributes=c('refseq_mrna','chromosome_name','start_position','e nd_position','strand'), filters = 'refseq_mrna', values = 'NM_181493', mart = ensembl) refseq_mrna chromosome_name start_position end_position strand 1 NM_181493 20 3189514 3204516 1 So it seems the RefSeq IDs are mapped to the longest Ensembl transcript model that covers the RefSeq model. I searched around the web and looked at different available marts but nothing obvious popped out. How should I go about obtaining RefSeq coordinates using biomaRt? Or biomaRt is Ensembl centric? sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252 [4] LC_NUMERIC=C LC_TIME=English_Australia.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.16.0 loaded via a namespace (and not attached): [1] RCurl_1.95-4.1 tools_3.0.2 XML_3.98-1.1 Cheers, -- Dave
ADD COMMENTlink modified 4.8 years ago by Sean Davis21k • written 4.8 years ago by Dave Tang210
0
gravatar for Sean Davis
4.8 years ago by
Sean Davis21k
United States
Sean Davis21k wrote:
On Mon, Nov 25, 2013 at 3:47 AM, Dave Tang <davetingpongtang@gmail.com>wrote: > Hello, > > I've been using biomaRt to fetch genomic coordinates of RefSeqs (perhaps > in an incorrect manner?). I found that the RefSeq coordinates don't match > the coordinates provided at the UCSC Genome Browser (NM_033453 at > chr20:3190006-3204516): > > library("biomaRt") > ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl") > getBM(attributes=c('refseq_mrna','chromosome_name',' > start_position','end_position','strand'), > filters = 'refseq_mrna', values = 'NM_033453', mart = ensembl) > > refseq_mrna chromosome_name start_position end_position strand > 1 NM_033453 20 3189514 3204516 1 > > The coordinates seem to match this Ensembl transcript (ENST00000483354) > instead: > > getBM(attributes=c('ensembl_transcript_id','chromosome_ > name','start_position','end_position','strand'), > filters = 'ensembl_transcript_id', values = 'ENST00000483354', mart = > ensembl) > > ensembl_transcript_id chromosome_name start_position end_position > strand > 1 ENST00000483354 20 3189514 3204516 1 > > Here's another RefSeq model, NM_181493, which should be mapped to > chr20:3190134-3204516: > > getBM(attributes=c('refseq_mrna','chromosome_name',' > start_position','end_position','strand'), > filters = 'refseq_mrna', values = 'NM_181493', mart = ensembl) > > refseq_mrna chromosome_name start_position end_position strand > 1 NM_181493 20 3189514 3204516 1 > > So it seems the RefSeq IDs are mapped to the longest Ensembl transcript > model that covers the RefSeq model. I searched around the web and looked at > different available marts but nothing obvious popped out. How should I go > about obtaining RefSeq coordinates using biomaRt? Or biomaRt is Ensembl > centric? > Hi, Dave. There may be multiple issues going on here, so you'll have to do some digging yourself when discrepancies arise like you see here. Working through your first example, keep in mind that neither Ensembl or UCSC are the actual curators of the RefSeq transcripts. NCBI is the source of that annotation. So, if you go to NCBI gene and search for NM_033453 and then play a bit with the Genomic Sequence Viewer, you'll note that the Gene (protein NP_258412.1) is mapped with the coordinates given at UCSC while the mRNA is mapped with the coordinates given by Ensembl. Add to this complication that UCSC does its own mapping of the transcripts (even RefSeq) and you could even have a "unique" set of coordinates given by UCSC (ie., not the same as NCBI or Ensembl). Sean [[alternative HTML version deleted]]
ADD COMMENTlink written 4.8 years ago by Sean Davis21k
On Mon, 25 Nov 2013 19:31:22 +0900, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: > Hi, Dave. > > There may be multiple issues going on here, so you'll have to do some > digging yourself when discrepancies arise like you see here. Working > through your first example, keep in mind that neither Ensembl or UCSC > are the actual curators of the RefSeq transcripts. NCBI is the source of > that annotation. So, if you go to NCBI gene and search for NM_033453 and > then play a bit with the Genomic Sequence Viewer, you'll note that the > Gene (protein NP_258412.1) is mapped with the coordinates given at UCSC > while the mRNA is mapped with the coordinates given by Ensembl. Add to > this complication that UCSC does its own mapping of the transcripts > (even RefSeq) and you could even have a "unique" set of coordinates > given by UCSC (ie., not the same as NCBI or Ensembl). Hi Sean, thank you for the prompt reply. My aim is to have a set of transcript annotations as opposed to gene annotations; I don't really mind whether they are RefSeqs or Ensembl transcript models. But I keep running into the same problem where the coordinates of either Ensembl or RefSeq transcripts are the coordinates of the Ensembl gene that encompasses all the transcripts, i.e. the longest Ensembl gene. Here's another example: library("biomaRt") ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl") #ENST00000398344 is on chr22:24,313,554-24,316,773 getBM(attributes = c('chromosome_name', 'start_position', 'end_position', 'strand' ), filters = 'ensembl_transcript_id', values = 'ENST00000398344', mart = ensembl) chromosome_name start_position end_position strand 1 22 24313554 24322660 -1 #ENST00000430101 is on chr22:24,315,293-24,316,648 getBM(attributes = c('chromosome_name', 'start_position', 'end_position', 'strand' ), filters = 'ensembl_transcript_id', values = 'ENST00000430101', mart = ensembl) chromosome_name start_position end_position strand 1 22 24313554 24322660 -1 Is it possible to obtain genomic coordinates of Ensembl transcript via biomaRt? sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252 [4] LC_NUMERIC=C LC_TIME=English_Australia.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.18.0 loaded via a namespace (and not attached): [1] RCurl_1.95-4.1 tools_3.0.2 XML_3.98-1.1 Cheers, -- Dave
ADD REPLYlink written 4.8 years ago by Dave Tang210
On Mon, Nov 25, 2013 at 7:39 AM, Dave Tang <davetingpongtang@gmail.com>wrote: > On Mon, 25 Nov 2013 19:31:22 +0900, Sean Davis <sdavis2@mail.nih.gov> > wrote: > > Hi, Dave. >> >> There may be multiple issues going on here, so you'll have to do some >> digging yourself when discrepancies arise like you see here. Working >> through your first example, keep in mind that neither Ensembl or UCSC are >> the actual curators of the RefSeq transcripts. NCBI is the source of that >> annotation. So, if you go to NCBI gene and search for NM_033453 and then >> play a bit with the Genomic Sequence Viewer, you'll note that the Gene >> (protein NP_258412.1) is mapped with the coordinates given at UCSC while >> the mRNA is mapped with the coordinates given by Ensembl. Add to this >> complication that UCSC does its own mapping of the transcripts (even >> RefSeq) and you could even have a "unique" set of coordinates given by UCSC >> (ie., not the same as NCBI or Ensembl). >> > > Hi Sean, > > thank you for the prompt reply. > > My aim is to have a set of transcript annotations as opposed to gene > annotations; I don't really mind whether they are RefSeqs or Ensembl > transcript models. But I keep running into the same problem where the > coordinates of either Ensembl or RefSeq transcripts are the coordinates of > the Ensembl gene that encompasses all the transcripts, i.e. the longest > Ensembl gene. Here's another example: > > > library("biomaRt") > ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl") > #ENST00000398344 is on chr22:24,313,554-24,316,773 > getBM(attributes = c('chromosome_name', > > 'start_position', > 'end_position', > 'strand' > ), > filters = 'ensembl_transcript_id', > values = 'ENST00000398344', > mart = ensembl) > chromosome_name start_position end_position strand > 1 22 24313554 24322660 -1 > > #ENST00000430101 is on chr22:24,315,293-24,316,648 > getBM(attributes = c('chromosome_name', > > 'start_position', > 'end_position', > 'strand' > ), > filters = 'ensembl_transcript_id', > values = 'ENST00000430101', > mart = ensembl) > chromosome_name start_position end_position strand > 1 22 24313554 24322660 -1 > > Is it possible to obtain genomic coordinates of Ensembl transcript via > biomaRt? Hi, Dave. You'll want to use "transcript_start" and "transcript_end" rather than "start_position" and "end_position". Sean [[alternative HTML version deleted]]
ADD REPLYlink written 4.8 years ago by Sean Davis21k
On Mon, 25 Nov 2013 22:04:23 +0900, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: > You'll want to use "transcript_start" and "transcript_end" rather than > "start_position" and "end_position". Thank you so much Sean! I spent the entire afternoon looking in all the wrong places. Cheers, -- Dave
ADD REPLYlink written 4.8 years ago by Dave Tang210
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 136 users visited in the last hour