Mapping genomic coordinates to transcript coordinates? (revived)
2
0
Entering edit mode
@herve-pages-1542
Last seen 16 hours ago
Seattle, WA, United States
Hi Chris, Malcolm, There is the transcriptLocs2refLocs() function in Biostrings that does the reverse mapping i.e. it maps transcript coordinates to genomic coordinates. There is no doubt that the GenomicFeatures package would be a better place for this function so we should move it there. Here is how it works: Let's say we have 3 transcripts, 2 on the + strand and 1 on the - strand, with the following exons (1-based genomic coordinates): exonStarts <- list( c(2401L, 3922L, 4806L), c(2401L, 4806L), c(6291L, 5278L) ) exonEnds <- list( c(3344L, 4421L, 5200L), c(3344L, 5200L), c(6500L, 5899L) ) strand <- c("+", "+", "-") The lengths of the transcripts are: > sapply(1:3, function(i) sum(exonEnds[[i]] - exonStarts[[i]] + 1L)) [1] 1839 1339 832 Let's say we have positions on each transcript that we want to map to the genome. Those positions are stored in a list of 3 integer vectors: tlocs <- list( c(1L, 200L, 1000L, 1839L), 940:950, c(1L, 832L) ) All those positions can be converted into genomic coordinates with transcriptLocs2refLocs(): > library(Biostrings) > transcriptLocs2refLocs(tlocs, exonStarts, exonEnds, strand) [[1]] [1] 2401 2600 3977 5200 [[2]] [1] 3340 3341 3342 3343 3344 4806 4807 4808 4809 4810 4811 [[3]] [1] 6500 5278 It's vectorized and fast (implemented in C). Unfortunately we don't have a refLocs2transcriptLocs() function at the moment for going the other way around but, yes, that's something we should definitely have. When called on the previous result and with the same 'exonStarts', 'exonEnds' and 'strand' values, it should return the original 'tlocs'. There would be 2 complications for such a refLocs2transcriptLocs though: 1. If the genomic location doesn't hit the transcript. Not a big deal, NA could be used for this. 2. Sometimes (very rarely) the genomic location hits an ambiguous location on the transcript (e.g. for a small number of transcripts in UCSC knownGene track, some exons overlap). What to do then? Also those 2 functions should really be in GenomicFeatures, not in Biostrings, and their interface should be modernized to accept a GRangesList object instead of exonStarts, exonEnds and strand (the transcriptLocs2refLocs() function predates the GenomicRanges era). Here in Seattle we didn't work on this yet because of lack of time and also because there was apparently no demand for it so far. For now, I'm just going to move transcriptLocs2refLocs() to GenomicFeatures so it's more visible and it will also make it easier for someone interested to contribute. H. ----- Original Message ----- From: "Chris Fields" <cjfields@illinois.edu> To: "Malcolm Cook" <mec at="" stowers.org=""> Cc: "lawrence.michael at gene.com" <lawrence.michael at="" gene.com="">, "amackey at virginia.edu" <amackey at="" virginia.edu="">, "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> Sent: Friday, February 25, 2011 7:45:11 AM Subject: Re: [BioC] Mapping genomic coordinates to transcript coordinates? (revived) +1 (from the bioperl end). Would be nice to have a straightforward BioC-based way of doing this. Aaron's suggestion of a similar (maybe simpler?) API to Bio::Coordinate::GeneMapper is good (though as pointed out in the thread it is a bit complex). chris On Feb 25, 2011, at 9:25 AM, Cook, Malcolm wrote: > Hiya, > > I am reviving this thread > > https://stat.ethz.ch/pipermail/bioconductor/2010-November/036661.html > > and am poised to follow the lead proposed by Marc Carlson > > but before I do..... has anyone beaten me to it? > > And/or does anyone have a suggestion as to how to best contribute such an effort back .... > > ... as an addition to GenomicFeatures perhaps > > Best, > > Malcolm Cook > Stowers Institute for Medical Research - Bioinformatics > Kansas City, Missouri USA > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
Biostrings GenomicFeatures Biostrings GenomicFeatures • 1.7k views
ADD COMMENT
0
Entering edit mode
Chris Fields ▴ 90
@chris-fields-4329
Last seen 2.1 years ago
United States
On Mar 3, 2011, at 1:58 AM, Pages, Herve wrote: > Hi Chris, Malcolm, > > There is the transcriptLocs2refLocs() function in Biostrings that > does the reverse mapping i.e. it maps transcript coordinates to > genomic coordinates. There is no doubt that the GenomicFeatures > package would be a better place for this function so we should move > it there. ... <apologies, excised="" the="" very="" useful="" code="" for="" easier="" reading=""> ... > It's vectorized and fast (implemented in C). Nice! > Unfortunately we don't have a refLocs2transcriptLocs() function at > the moment for going the other way around but, yes, that's something > we should definitely have. When called on the previous result and with > the same 'exonStarts', 'exonEnds' and 'strand' values, it should return > the original 'tlocs'. > > There would be 2 complications for such a refLocs2transcriptLocs though: > > 1. If the genomic location doesn't hit the transcript. Not a big deal, > NA could be used for this. Agreed. > 2. Sometimes (very rarely) the genomic location hits an ambiguous > location on the transcript (e.g. for a small number of transcripts > in UCSC knownGene track, some exons overlap). What to do then? I suppose we would need examples of this, at least for documenting in the future. As for what to do, not sure myself beyond issuing a warning about the ambiguity and returning the first or last value (or have an argument indicating what to do under such circumstances, such as allow a user-defined function pick the value, etc). > Also those 2 functions should really be in GenomicFeatures, not > in Biostrings, and their interface should be modernized to accept > a GRangesList object instead of exonStarts, exonEnds and strand > (the transcriptLocs2refLocs() function predates the GenomicRanges > era). I agree. I wouldn't think to find this in Biostrings. > Here in Seattle we didn't work on this yet because of lack of time > and also because there was apparently no demand for it so far. For > now, I'm just going to move transcriptLocs2refLocs() to GenomicFeatures > so it's more visible and it will also make it easier for someone > interested to contribute. > > H. Seems to be the way things are implemented in any OS project, someone has an itch to scratch. chris
ADD COMMENT
0
Entering edit mode
Hi Chris, Malcom, On 03/03/2011 06:45 AM, Chris Fields wrote: [...] >> For now, I'm just going to move transcriptLocs2refLocs() to GenomicFeatures >> so it's more visible and it will also make it easier for someone >> interested to contribute. This is done in GenomicFeatures 1.3.14 (devel). Cheers, H.
ADD REPLY
0
Entering edit mode
> > Hi Chris, Malcolm, ... All good news, Herve, thanks for the history and status and plans.... > > Here in Seattle we didn't work on this yet because of lack > of time and > > also because there was apparently no demand for it so far. For now, > > I'm just going to move transcriptLocs2refLocs() to > GenomicFeatures so > > it's more visible and it will also make it easier for someone > > interested to contribute. > > > > H. > > Seems to be the way things are implemented in any OS project, > someone has an itch to scratch. ... I'm likely to scratch it myself soon.... though not in C
ADD REPLY
0
Entering edit mode
@michael-lawrence-3846
Last seen 2.4 years ago
United States
On Wed, Mar 2, 2011 at 11:58 PM, Pages, Herve <hpages@fhcrc.org> wrote: > Hi Chris, Malcolm, > > There is the transcriptLocs2refLocs() function in Biostrings that > does the reverse mapping i.e. it maps transcript coordinates to > genomic coordinates. There is no doubt that the GenomicFeatures > package would be a better place for this function so we should move > it there. > > Here is how it works: > > Let's say we have 3 transcripts, 2 on the + strand and 1 on the - > strand, with the following exons (1-based genomic coordinates): > > exonStarts <- list( > c(2401L, 3922L, 4806L), > c(2401L, 4806L), > c(6291L, 5278L) > ) > > exonEnds <- list( > c(3344L, 4421L, 5200L), > c(3344L, 5200L), > c(6500L, 5899L) > ) > > strand <- c("+", "+", "-") > > The lengths of the transcripts are: > > > sapply(1:3, function(i) sum(exonEnds[[i]] - exonStarts[[i]] + 1L)) > [1] 1839 1339 832 > > Let's say we have positions on each transcript that we want to > map to the genome. Those positions are stored in a list of 3 > integer vectors: > > tlocs <- list( > c(1L, 200L, 1000L, 1839L), > 940:950, > c(1L, 832L) > ) > > All those positions can be converted into genomic coordinates with > transcriptLocs2refLocs(): > > This is a cool function but can't it be generalized for any compound set of ranges, like any GRangesList? For example, one could want to translate read-relative positions in a spliced read alignment to global positions, or the other way around. Especially the other way around. Returning an IntegerList would be one way to handle multiple possibilities. > > library(Biostrings) > > transcriptLocs2refLocs(tlocs, exonStarts, exonEnds, strand) > [[1]] > [1] 2401 2600 3977 5200 > > [[2]] > [1] 3340 3341 3342 3343 3344 4806 4807 4808 4809 4810 4811 > > [[3]] > [1] 6500 5278 > > It's vectorized and fast (implemented in C). > > Unfortunately we don't have a refLocs2transcriptLocs() function at > the moment for going the other way around but, yes, that's something > we should definitely have. When called on the previous result and with > the same 'exonStarts', 'exonEnds' and 'strand' values, it should return > the original 'tlocs'. > > There would be 2 complications for such a refLocs2transcriptLocs though: > > 1. If the genomic location doesn't hit the transcript. Not a big deal, > NA could be used for this. > > 2. Sometimes (very rarely) the genomic location hits an ambiguous > location on the transcript (e.g. for a small number of transcripts > in UCSC knownGene track, some exons overlap). What to do then? > > Also those 2 functions should really be in GenomicFeatures, not > in Biostrings, and their interface should be modernized to accept > a GRangesList object instead of exonStarts, exonEnds and strand > (the transcriptLocs2refLocs() function predates the GenomicRanges > era). > > Here in Seattle we didn't work on this yet because of lack of time > and also because there was apparently no demand for it so far. For > now, I'm just going to move transcriptLocs2refLocs() to GenomicFeatures > so it's more visible and it will also make it easier for someone > interested to contribute. > > H. > > > ----- Original Message ----- > From: "Chris Fields" <cjfields@illinois.edu> > To: "Malcolm Cook" <mec@stowers.org> > Cc: "lawrence.michael@gene.com" <lawrence.michael@gene.com>, " > amackey@virginia.edu" <amackey@virginia.edu>, "bioconductor@r-project.org" > <bioconductor@r-project.org> > Sent: Friday, February 25, 2011 7:45:11 AM > Subject: Re: [BioC] Mapping genomic coordinates to transcript coordinates? > (revived) > > +1 (from the bioperl end). Would be nice to have a straightforward > BioC-based way of doing this. Aaron's suggestion of a similar (maybe > simpler?) API to Bio::Coordinate::GeneMapper is good (though as pointed out > in the thread it is a bit complex). > > chris > > On Feb 25, 2011, at 9:25 AM, Cook, Malcolm wrote: > > > Hiya, > > > > I am reviving this thread > > > > > https://stat.ethz.ch/pipermail/bioconductor/2010-November/036661.html > > > > and am poised to follow the lead proposed by Marc Carlson > > > > but before I do..... has anyone beaten me to it? > > > > And/or does anyone have a suggestion as to how to best contribute such an > effort back .... > > > > ... as an addition to GenomicFeatures perhaps > > > > Best, > > > > Malcolm Cook > > Stowers Institute for Medical Research - Bioinformatics > > Kansas City, Missouri USA > > > > > > > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Hi Michael, The naming of the function and its arguments makes it sound restricted to transcripts and exons, but, currently, the function can use any set of compound ranges. For example, if those ranges are in a GRangesList object 'grl', then use 'start(grl)' for the exons starts and 'end(grl)' for the exon ends. One restriction is that only one strand value can be assigned for any given group of ranges (i.e. any top-level element in the GRangesList). Assuming 'grl' satisfies this (and I guess that would be the case for spliced reads), then you can extract the value to pass to the strand argument with something like sapply(grl, function(gr) strand(gr)[1L]) (there are more efficient ways to do this) As I said previously transcriptLocs2refLocs() is low-level and would need to be improved or wrapped into some higher-level tool (generic function?) that would work on GRangesList, GappedAlignments, and maybe other high-level representations of sets of compound ranges. Cheers, H. ----- Original Message ----- From: "Michael Lawrence" <lawrence.michael@gene.com> To: "Herve Pages" <hpages at="" fhcrc.org=""> Cc: "Chris Fields" <cjfields at="" illinois.edu="">, bioconductor at r-project.org Sent: Wednesday, March 16, 2011 4:16:56 PM Subject: Re: [BioC] Mapping genomic coordinates to transcript coordinates? (revived) On Wed, Mar 2, 2011 at 11:58 PM, Pages, Herve < hpages at fhcrc.org > wrote: Hi Chris, Malcolm, There is the transcriptLocs2refLocs() function in Biostrings that does the reverse mapping i.e. it maps transcript coordinates to genomic coordinates. There is no doubt that the GenomicFeatures package would be a better place for this function so we should move it there. Here is how it works: Let's say we have 3 transcripts, 2 on the + strand and 1 on the - strand, with the following exons (1-based genomic coordinates): exonStarts <- list( c(2401L, 3922L, 4806L), c(2401L, 4806L), c(6291L, 5278L) ) exonEnds <- list( c(3344L, 4421L, 5200L), c(3344L, 5200L), c(6500L, 5899L) ) strand <- c("+", "+", "-") The lengths of the transcripts are: > sapply(1:3, function(i) sum(exonEnds[[i]] - exonStarts[[i]] + 1L)) [1] 1839 1339 832 Let's say we have positions on each transcript that we want to map to the genome. Those positions are stored in a list of 3 integer vectors: tlocs <- list( c(1L, 200L, 1000L, 1839L), 940:950, c(1L, 832L) ) All those positions can be converted into genomic coordinates with transcriptLocs2refLocs(): This is a cool function but can't it be generalized for any compound set of ranges, like any GRangesList? For example, one could want to translate read-relative positions in a spliced read alignment to global positions, or the other way around. Especially the other way around. Returning an IntegerList would be one way to handle multiple possibilities. > library(Biostrings) > transcriptLocs2refLocs(tlocs, exonStarts, exonEnds, strand) [[1]] [1] 2401 2600 3977 5200 [[2]] [1] 3340 3341 3342 3343 3344 4806 4807 4808 4809 4810 4811 [[3]] [1] 6500 5278 It's vectorized and fast (implemented in C). Unfortunately we don't have a refLocs2transcriptLocs() function at the moment for going the other way around but, yes, that's something we should definitely have. When called on the previous result and with the same 'exonStarts', 'exonEnds' and 'strand' values, it should return the original 'tlocs'. There would be 2 complications for such a refLocs2transcriptLocs though: 1. If the genomic location doesn't hit the transcript. Not a big deal, NA could be used for this. 2. Sometimes (very rarely) the genomic location hits an ambiguous location on the transcript (e.g. for a small number of transcripts in UCSC knownGene track, some exons overlap). What to do then? Also those 2 functions should really be in GenomicFeatures, not in Biostrings, and their interface should be modernized to accept a GRangesList object instead of exonStarts, exonEnds and strand (the transcriptLocs2refLocs() function predates the GenomicRanges era). Here in Seattle we didn't work on this yet because of lack of time and also because there was apparently no demand for it so far. For now, I'm just going to move transcriptLocs2refLocs() to GenomicFeatures so it's more visible and it will also make it easier for someone interested to contribute. H. ----- Original Message ----- From: "Chris Fields" < cjfields@illinois.edu > To: "Malcolm Cook" < MEC at stowers.org > Cc: " lawrence.michael at gene.com " < lawrence.michael at gene.com >, " amackey at virginia.edu " < amackey at virginia.edu >, " bioconductor at r-project.org " < bioconductor at r-project.org > Sent: Friday, February 25, 2011 7:45:11 AM Subject: Re: [BioC] Mapping genomic coordinates to transcript coordinates? (revived) +1 (from the bioperl end). Would be nice to have a straightforward BioC-based way of doing this. Aaron's suggestion of a similar (maybe simpler?) API to Bio::Coordinate::GeneMapper is good (though as pointed out in the thread it is a bit complex). chris On Feb 25, 2011, at 9:25 AM, Cook, Malcolm wrote: > Hiya, > > I am reviving this thread > > https://stat.ethz.ch/pipermail/bioconductor/2010-November/036661.html > > and am poised to follow the lead proposed by Marc Carlson > > but before I do..... has anyone beaten me to it? > > And/or does anyone have a suggestion as to how to best contribute such an effort back .... > > ... as an addition to GenomicFeatures perhaps > > Best, > > Malcolm Cook > Stowers Institute for Medical Research - Bioinformatics > Kansas City, Missouri USA > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Great. Given that I need a "refLocs2TranscriptLocs" or "globalToLocal" function right away, I'll stick it into GenomicRanges for you guys to tweak. Should be some fairly simple R code. Returning a DataFrame with two columns: subject index (into the grl) and mapped local coordinate. Michael On Wed, Mar 16, 2011 at 6:13 PM, Pages, Herve <hpages@fhcrc.org> wrote: > Hi Michael, > > The naming of the function and its arguments makes it sound restricted to > transcripts and exons, but, currently, the function can use any set of > compound ranges. For example, if those ranges are in a GRangesList > object 'grl', then use 'start(grl)' for the exons starts and 'end(grl)' > for the exon ends. One restriction is that only one strand value can > be assigned for any given group of ranges (i.e. any top-level element in > the GRangesList). Assuming 'grl' satisfies this (and I guess that would > be the case for spliced reads), then you can extract the value to pass > to the strand argument with something like > > sapply(grl, function(gr) strand(gr)[1L]) > > (there are more efficient ways to do this) > > As I said previously transcriptLocs2refLocs() is low-level and would need > to be improved or wrapped into some higher-level tool (generic function?) > that would work on GRangesList, GappedAlignments, and maybe other > high-level > representations of sets of compound ranges. > > Cheers, > H. > > > ----- Original Message ----- > From: "Michael Lawrence" <lawrence.michael@gene.com> > To: "Herve Pages" <hpages@fhcrc.org> > Cc: "Chris Fields" <cjfields@illinois.edu>, bioconductor@r-project.org > Sent: Wednesday, March 16, 2011 4:16:56 PM > Subject: Re: [BioC] Mapping genomic coordinates to transcript coordinates? > (revived) > > > > > On Wed, Mar 2, 2011 at 11:58 PM, Pages, Herve < hpages@fhcrc.org > wrote: > > > Hi Chris, Malcolm, > > There is the transcriptLocs2refLocs() function in Biostrings that > does the reverse mapping i.e. it maps transcript coordinates to > genomic coordinates. There is no doubt that the GenomicFeatures > package would be a better place for this function so we should move > it there. > > Here is how it works: > > Let's say we have 3 transcripts, 2 on the + strand and 1 on the - > strand, with the following exons (1-based genomic coordinates): > > exonStarts <- list( > c(2401L, 3922L, 4806L), > c(2401L, 4806L), > c(6291L, 5278L) > ) > > exonEnds <- list( > c(3344L, 4421L, 5200L), > c(3344L, 5200L), > c(6500L, 5899L) > ) > > strand <- c("+", "+", "-") > > The lengths of the transcripts are: > > > sapply(1:3, function(i) sum(exonEnds[[i]] - exonStarts[[i]] + 1L)) > [1] 1839 1339 832 > > Let's say we have positions on each transcript that we want to > map to the genome. Those positions are stored in a list of 3 > integer vectors: > > tlocs <- list( > c(1L, 200L, 1000L, 1839L), > 940:950, > c(1L, 832L) > ) > > All those positions can be converted into genomic coordinates with > transcriptLocs2refLocs(): > > > > > This is a cool function but can't it be generalized for any compound set of > ranges, like any GRangesList? For example, one could want to translate > read-relative positions in a spliced read alignment to global positions, or > the other way around. Especially the other way around. Returning an > IntegerList would be one way to handle multiple possibilities. > > > > > > library(Biostrings) > > transcriptLocs2refLocs(tlocs, exonStarts, exonEnds, strand) > [[1]] > [1] 2401 2600 3977 5200 > > [[2]] > [1] 3340 3341 3342 3343 3344 4806 4807 4808 4809 4810 4811 > > [[3]] > [1] 6500 5278 > > It's vectorized and fast (implemented in C). > > Unfortunately we don't have a refLocs2transcriptLocs() function at > the moment for going the other way around but, yes, that's something > we should definitely have. When called on the previous result and with > the same 'exonStarts', 'exonEnds' and 'strand' values, it should return > the original 'tlocs'. > > There would be 2 complications for such a refLocs2transcriptLocs though: > > 1. If the genomic location doesn't hit the transcript. Not a big deal, > NA could be used for this. > > 2. Sometimes (very rarely) the genomic location hits an ambiguous > location on the transcript (e.g. for a small number of transcripts > in UCSC knownGene track, some exons overlap). What to do then? > > Also those 2 functions should really be in GenomicFeatures, not > in Biostrings, and their interface should be modernized to accept > a GRangesList object instead of exonStarts, exonEnds and strand > (the transcriptLocs2refLocs() function predates the GenomicRanges > era). > > Here in Seattle we didn't work on this yet because of lack of time > and also because there was apparently no demand for it so far. For > now, I'm just going to move transcriptLocs2refLocs() to GenomicFeatures > so it's more visible and it will also make it easier for someone > interested to contribute. > > H. > > > > > > ----- Original Message ----- > From: "Chris Fields" < cjfields@illinois.edu > > To: "Malcolm Cook" < MEC@stowers.org > > Cc: " lawrence.michael@gene.com " < lawrence.michael@gene.com >, " > amackey@virginia.edu " < amackey@virginia.edu >, " > bioconductor@r-project.org " < bioconductor@r-project.org > > Sent: Friday, February 25, 2011 7:45:11 AM > Subject: Re: [BioC] Mapping genomic coordinates to transcript coordinates? > (revived) > > +1 (from the bioperl end). Would be nice to have a straightforward > BioC-based way of doing this. Aaron's suggestion of a similar (maybe > simpler?) API to Bio::Coordinate::GeneMapper is good (though as pointed out > in the thread it is a bit complex). > > chris > > On Feb 25, 2011, at 9:25 AM, Cook, Malcolm wrote: > > > Hiya, > > > > I am reviving this thread > > > > https://stat.ethz.ch/pipermail/bioconductor/2010-November/036661.html > > > > and am poised to follow the lead proposed by Marc Carlson > > > > but before I do..... has anyone beaten me to it? > > > > And/or does anyone have a suggestion as to how to best contribute such an > effort back .... > > > > ... as an addition to GenomicFeatures perhaps > > > > Best, > > > > Malcolm Cook > > Stowers Institute for Medical Research - Bioinformatics > > Kansas City, Missouri USA > > > > > > > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
OK. Can you please put this in GenomicFeatures instead of GenomicRanges? This is where transcriptLocs2refLocs() is right now. Thanks! H. ----- Original Message ----- From: "Michael Lawrence" <lawrence.michael@gene.com> To: "Herve Pages" <hpages at="" fhcrc.org=""> Cc: "Michael Lawrence" <lawrence.michael at="" gene.com="">, "Chris Fields" <cjfields at="" illinois.edu="">, bioconductor at r-project.org Sent: Wednesday, March 16, 2011 8:32:48 PM Subject: Re: [BioC] Mapping genomic coordinates to transcript coordinates? (revived) Great. Given that I need a "refLocs2TranscriptLocs" or "globalToLocal" function right away, I'll stick it into GenomicRanges for you guys to tweak. Should be some fairly simple R code. Returning a DataFrame with two columns: subject index (into the grl) and mapped local coordinate. Michael On Wed, Mar 16, 2011 at 6:13 PM, Pages, Herve < hpages at fhcrc.org > wrote: Hi Michael, The naming of the function and its arguments makes it sound restricted to transcripts and exons, but, currently, the function can use any set of compound ranges. For example, if those ranges are in a GRangesList object 'grl', then use 'start(grl)' for the exons starts and 'end(grl)' for the exon ends. One restriction is that only one strand value can be assigned for any given group of ranges (i.e. any top-level element in the GRangesList). Assuming 'grl' satisfies this (and I guess that would be the case for spliced reads), then you can extract the value to pass to the strand argument with something like sapply(grl, function(gr) strand(gr)[1L]) (there are more efficient ways to do this) As I said previously transcriptLocs2refLocs() is low-level and would need to be improved or wrapped into some higher-level tool (generic function?) that would work on GRangesList, GappedAlignments, and maybe other high-level representations of sets of compound ranges. Cheers, H. ----- Original Message ----- From: "Michael Lawrence" < lawrence.michael@gene.com > To: "Herve Pages" < hpages at fhcrc.org > Cc: "Chris Fields" < cjfields at illinois.edu >, bioconductor at r-project.org Sent: Wednesday, March 16, 2011 4:16:56 PM Subject: Re: [BioC] Mapping genomic coordinates to transcript coordinates? (revived) On Wed, Mar 2, 2011 at 11:58 PM, Pages, Herve < hpages at fhcrc.org > wrote: Hi Chris, Malcolm, There is the transcriptLocs2refLocs() function in Biostrings that does the reverse mapping i.e. it maps transcript coordinates to genomic coordinates. There is no doubt that the GenomicFeatures package would be a better place for this function so we should move it there. Here is how it works: Let's say we have 3 transcripts, 2 on the + strand and 1 on the - strand, with the following exons (1-based genomic coordinates): exonStarts <- list( c(2401L, 3922L, 4806L), c(2401L, 4806L), c(6291L, 5278L) ) exonEnds <- list( c(3344L, 4421L, 5200L), c(3344L, 5200L), c(6500L, 5899L) ) strand <- c("+", "+", "-") The lengths of the transcripts are: > sapply(1:3, function(i) sum(exonEnds[[i]] - exonStarts[[i]] + 1L)) [1] 1839 1339 832 Let's say we have positions on each transcript that we want to map to the genome. Those positions are stored in a list of 3 integer vectors: tlocs <- list( c(1L, 200L, 1000L, 1839L), 940:950, c(1L, 832L) ) All those positions can be converted into genomic coordinates with transcriptLocs2refLocs(): This is a cool function but can't it be generalized for any compound set of ranges, like any GRangesList? For example, one could want to translate read-relative positions in a spliced read alignment to global positions, or the other way around. Especially the other way around. Returning an IntegerList would be one way to handle multiple possibilities. > library(Biostrings) > transcriptLocs2refLocs(tlocs, exonStarts, exonEnds, strand) [[1]] [1] 2401 2600 3977 5200 [[2]] [1] 3340 3341 3342 3343 3344 4806 4807 4808 4809 4810 4811 [[3]] [1] 6500 5278 It's vectorized and fast (implemented in C). Unfortunately we don't have a refLocs2transcriptLocs() function at the moment for going the other way around but, yes, that's something we should definitely have. When called on the previous result and with the same 'exonStarts', 'exonEnds' and 'strand' values, it should return the original 'tlocs'. There would be 2 complications for such a refLocs2transcriptLocs though: 1. If the genomic location doesn't hit the transcript. Not a big deal, NA could be used for this. 2. Sometimes (very rarely) the genomic location hits an ambiguous location on the transcript (e.g. for a small number of transcripts in UCSC knownGene track, some exons overlap). What to do then? Also those 2 functions should really be in GenomicFeatures, not in Biostrings, and their interface should be modernized to accept a GRangesList object instead of exonStarts, exonEnds and strand (the transcriptLocs2refLocs() function predates the GenomicRanges era). Here in Seattle we didn't work on this yet because of lack of time and also because there was apparently no demand for it so far. For now, I'm just going to move transcriptLocs2refLocs() to GenomicFeatures so it's more visible and it will also make it easier for someone interested to contribute. H. ----- Original Message ----- From: "Chris Fields" < cjfields@illinois.edu > To: "Malcolm Cook" < MEC at stowers.org > Cc: " lawrence.michael at gene.com " < lawrence.michael at gene.com >, " amackey at virginia.edu " < amackey at virginia.edu >, " bioconductor at r-project.org " < bioconductor at r-project.org > Sent: Friday, February 25, 2011 7:45:11 AM Subject: Re: [BioC] Mapping genomic coordinates to transcript coordinates? (revived) +1 (from the bioperl end). Would be nice to have a straightforward BioC-based way of doing this. Aaron's suggestion of a similar (maybe simpler?) API to Bio::Coordinate::GeneMapper is good (though as pointed out in the thread it is a bit complex). chris On Feb 25, 2011, at 9:25 AM, Cook, Malcolm wrote: > Hiya, > > I am reviving this thread > > https://stat.ethz.ch/pipermail/bioconductor/2010-November/036661.html > > and am poised to follow the lead proposed by Marc Carlson > > but before I do..... has anyone beaten me to it? > > And/or does anyone have a suggestion as to how to best contribute such an effort back .... > > ... as an addition to GenomicFeatures perhaps > > Best, > > Malcolm Cook > Stowers Institute for Medical Research - Bioinformatics > Kansas City, Missouri USA > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY

Login before adding your answer.

Traffic: 1070 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6