A story about IRanges::GappedRanges ... how does it end?
1
0
Entering edit mode
@steve-lianoglou-2771
Last seen 14 months ago
United States
Hello, It being Sunday and all, here's a little story that I've got to tell (apologies in advance for indulging in some navel gazing): Once upon a time (yesterday) I found myself implementing a GappedRanges class[1] because I wanted to "talk about" and manipulate distances/etc. between two points that are aligned to "genomic" space, but really come from "transcriptome" space (so, I wanted to keep track of introns and things and subtract them away in results from calls to width() and its ilk). I thought I was being really clever and was liking how it was coming together. I was testing in all sorts of manner, and when I tried to have a column of a DataFrame hold my GappedRanges objects, I got some funny errors -- something about some "cnirl" slot. After some head scratching, I went digging through the IRanges source code -- lo and behold I found a GappedRanges class .. and it has a `cnirl` slot! I paused for a moment to recognize that whoever saw the need for this class and thought to implement it must surely be a prodigy in the truest sense of the word ... When my day dream had passed, I got to wondering how it was that I haven't stumbled across this class until now. I realized that no one really talks about GappedRanges on the mailing list(s) and there's no real mention of it in the IRangesOverview vignette. An `svn log` on the file shows no activity on it since December 2009, so I was wondering if this class was on its way out? Maybe someone thought it was a good idea to write it (greetings, kindred spirit) and has since come to the conclusion that maybe it's unnecessary (in which case, I'd appreciate some wisdom about that to save me some time :-) I was planning on implementing a findOverlaps method against GappedRanges that act "the normal way" but exclude any overlaps that fall within the gapped portion of the ranges ... but, I wanted to see where GappedRanges stands now before I proceed (and maybe just doing a findOverlaps against a GRangesList is close to the same thing). Perhaps I'd just be duplicating someone else's work, who's already found a better way to deal with the use cases this class (I think) would work well for. Anyway ... I was just curious about what's going on with the GappedRanges class. I find it's a natural fit for some things, even though I would grant you that you could probably twist a GRangesList into doing something similar. Thanks, -steve [1] My (partial) GappedRanges implementation: http://github.com/lianos/GenomicFeaturesX/blob/master/R/GappedRanges.R -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
Cancer IRanges Cancer IRanges • 616 views
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 2.4 years ago
United States
On Sun, Oct 24, 2010 at 11:02 AM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hello, > > It being Sunday and all, here's a little story that I've got to tell > (apologies in advance for indulging in some navel gazing): > > Once upon a time (yesterday) I found myself implementing a > GappedRanges class[1] because I wanted to "talk about" and manipulate > distances/etc. between two points that are aligned to "genomic" space, > but really come from "transcriptome" space (so, I wanted to keep track > of introns and things and subtract them away in results from calls to > width() and its ilk). > > I thought I was being really clever and was liking how it was coming > together. I was testing in all sorts of manner, and when I tried to > have a column of a DataFrame hold my GappedRanges objects, I got some > funny errors -- something about some "cnirl" slot. > > After some head scratching, I went digging through the IRanges source > code -- lo and behold I found a GappedRanges class .. and it has a > `cnirl` slot! I paused for a moment to recognize that whoever saw the > need for this class and thought to implement it must surely be a > prodigy in the truest sense of the word ... > > When my day dream had passed, I got to wondering how it was that I > haven't stumbled across this class until now. I realized that no one > really talks about GappedRanges on the mailing list(s) and there's no > real mention of it in the IRangesOverview vignette. An `svn log` on > the file shows no activity on it since December 2009, so I was > wondering if this class was on its way out? Maybe someone thought it > was a good idea to write it (greetings, kindred spirit) and has since > come to the conclusion that maybe it's unnecessary (in which case, I'd > appreciate some wisdom about that to save me some time :-) > > I was planning on implementing a findOverlaps method against > GappedRanges that act "the normal way" but exclude any overlaps that > fall within the gapped portion of the ranges ... but, I wanted to see > where GappedRanges stands now before I proceed (and maybe just doing a > findOverlaps against a GRangesList is close to the same thing). > Perhaps I'd just be duplicating someone else's work, who's already > found a better way to deal with the use cases this class (I think) > would work well for. > > Anyway ... I was just curious about what's going on with the > GappedRanges class. I find it's a natural fit for some things, even > though I would grant you that you could probably twist a GRangesList > into doing something similar. > > It's still supported. I'm guessing it hasn't changed due to lack of use. Therefore, no feature or bug requests. Could you please elaborate on why GappedRanges is more convenient than GRangesList? The assertion of normality in GappedRanges might be troublesome (e.g., could not throw overlapping exons of a gene into the same element). GappedRanges also wouldn't handle multi-chromosome features (even some genes are annotated on multiple chromosomes). I guess GappedRanges::start() will return a vector of the starts of GappedRanges, instead of an IntegerList, but one just needs to call min() on that list to get around that. For representing transcripts, we've been well served by GRangesList. If you have spliced read alignments, consider GappedAlignments for those. Michael Thanks, > -steve > > [1] My (partial) GappedRanges implementation: > http://github.com/lianos/GenomicFeaturesX/blob/master/R/GappedRanges.R > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact<http: cbio.mskc="" c.org="" %7elianos="" contact=""> > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Howdy, Thanks for taking the time to read through my musings ... this email is also a bit long, but I'm just trying to be descriptive enough to get my point across. On Mon, Oct 25, 2010 at 10:45 AM, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: <snip> > It's still supported. I'm guessing it hasn't changed due to lack of use. > Therefore, no feature or bug requests. Could you please elaborate on why > GappedRanges is more convenient than GRangesList? Honestly, it just *feels* right :-) I want to represent two distances in transcriptome space using their coordinates in genomic space. Some of the data we are generating is high throughput SAGE sequencing data, so I'm developing a package to work with it (soup to nuts)[1] I wanted to use the GappedRanges for the SAGEseq protocol. Without getting into too many "boring" details of the protocol, I have two positions on a transcript (indicating where restriction sites are). The upstream position is defined by the location of the tag that was sequenced (the first 4 bp of the tag is the restriction site + 17 bp of transcript sequence). The downstream position is defined by the where the position of the next 3' restriction site is. The downstream 3' restriction site on the transcript isn't always the most proximal 3' restriction site in "genome space" since this site might occur in an intron -- I want to skip these sites and only report sites in exons. These two restriction sites define my "fence posts". Does that make sense? The data I have would be a GRanges object, with a GappedRanges object as one of the columns of the GRanges elementMetadata slot. A completely contrived example would look like so: R> library(GenomicRanges) R> gr <- GRanges(seqnames='chr1', IRanges(c(1, 100), width=21), strand='+') R> .gaps <- as(IRangesList(IRanges(c(1, 50, 80), width=c(30, 10, 15)), IRanges(c(100, 150, 180), width=c(30, 10, 15))), 'GappedRanges') R> values(gr) <- DataFrame(fence.posts=.gaps) GRanges with 2 ranges and 1 elementMetadata value seqnames ranges strand | fence.posts <rle> <iranges> <rle> | <gappedranges> [1] chr1 [ 1, 21] + | [ 1, 94] [2] chr1 [100, 120] + | [100, 194] Positions 94 and 194 represent the next possible position of my restriction site while 'honoring' exon/transcript structure. Having my data this way, I can also quickly look at the exon structure present in the transcript that my "real" tag (the one I sequenced) landed in. With some trivial-to-define functions over GappedRanges objects: http://github.com/lianos/GenomicFeaturesX/blob/master/R/GappedRanges- IRanges.R I can then find out interesting things about my fence.posts. Like: (i) the distance between them in transcriptome space (by 'minding the gaps'): R> gwidth(values(gr)$fence.posts, mind.the.gap=TRUE) [1] 55 55 or genomic space: R> gwidth(values(gr)$fence.posts, mind.the.gap=FALSE) [1] 94 95 (ii) I can also use the info embedded in my gapped ranges to inquire about the exon structure present between my fence posts: R> ranges(values(gr)$fence.posts, mind.the.gap=TRUE) CompressedNormalIRangesList of length 2 [[1]] NormalIRanges of length 3 start end width [1] 1 30 30 [2] 50 59 10 [3] 80 94 15 [[2]] NormalIRanges of length 3 start end width [1] 100 129 30 [2] 150 159 10 [3] 180 194 15 (ii) Or I can just ignore the embedded gap/exon structure: IRanges of length 2 start end width [1] 1 94 94 [2] 100 194 95 So -- like I said, it just seems like a more natural fit to me than a GRangesList here -- and it's a bit less heavy weight than a GRangesList as I don't need strand/seqinfo information for my gaps since they are already present in the GRanges themselves. Hopefully that makes some sense? -steve [1] The package can also work with other tag-sequencing data. We're also developing a "new" high throughput 3'RACE-like protocol to sequence the very end of transcripts, and there is DeepCAGE (already in the wild) data that I'd also like my package to be able to deal with. (in case anybody was interested :-) -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY
0
Entering edit mode
On Wed, Oct 27, 2010 at 7:09 AM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Howdy, > > Thanks for taking the time to read through my musings ... this email > is also a bit long, but I'm just trying to be descriptive enough to > get my point across. > > On Mon, Oct 25, 2010 at 10:45 AM, Michael Lawrence > <lawrence.michael@gene.com> wrote: > <snip> > > It's still supported. I'm guessing it hasn't changed due to lack of use. > > Therefore, no feature or bug requests. Could you please elaborate on why > > GappedRanges is more convenient than GRangesList? > > Honestly, it just *feels* right :-) > > I want to represent two distances in transcriptome space using their > coordinates in genomic space. > > Some of the data we are generating is high throughput SAGE sequencing > data, so I'm developing a package to work with it (soup to nuts)[1] > > I wanted to use the GappedRanges for the SAGEseq protocol. Without > getting into too many "boring" details of the protocol, I have two > positions on a transcript (indicating where restriction sites are). > The upstream position is defined by the location of the tag that was > sequenced (the first 4 bp of the tag is the restriction site + 17 bp > of transcript sequence). The downstream position is defined by the > where the position of the next 3' restriction site is. The downstream > 3' restriction site on the transcript isn't always the most proximal > 3' restriction site in "genome space" since this site might occur in > an intron -- I want to skip these sites and only report sites in > exons. > > These two restriction sites define my "fence posts". Does that make sense? > > The data I have would be a GRanges object, with a GappedRanges object > as one of the columns of the GRanges elementMetadata slot. A > completely contrived example would look like so: > > R> library(GenomicRanges) > R> gr <- GRanges(seqnames='chr1', IRanges(c(1, 100), width=21), strand='+') > R> .gaps <- as(IRangesList(IRanges(c(1, 50, 80), width=c(30, 10, 15)), > IRanges(c(100, 150, 180), width=c(30, 10, 15))), > 'GappedRanges') > R> values(gr) <- DataFrame(fence.posts=.gaps) > > GRanges with 2 ranges and 1 elementMetadata value > seqnames ranges strand | fence.posts > <rle> <iranges> <rle> | <gappedranges> > [1] chr1 [ 1, 21] + | [ 1, 94] > [2] chr1 [100, 120] + | [100, 194] > > Positions 94 and 194 represent the next possible position of my > restriction site while 'honoring' exon/transcript structure. Having my > data this way, I can also quickly look at the exon structure present > in the transcript that my "real" tag (the one I sequenced) landed in. > > With some trivial-to-define functions over GappedRanges objects: > > http://github.com/lianos/GenomicFeaturesX/blob/master/R /GappedRanges-IRanges.R > > I can then find out interesting things about my fence.posts. Like: > > (i) the distance between them in transcriptome space (by 'minding the > gaps'): > R> gwidth(values(gr)$fence.posts, mind.the.gap=TRUE) > [1] 55 55 > > or genomic space: > R> gwidth(values(gr)$fence.posts, mind.the.gap=FALSE) > [1] 94 95 > > (ii) I can also use the info embedded in my gapped ranges to inquire > about the exon structure present between my fence posts: > > R> ranges(values(gr)$fence.posts, mind.the.gap=TRUE) > CompressedNormalIRangesList of length 2 > [[1]] > NormalIRanges of length 3 > start end width > [1] 1 30 30 > [2] 50 59 10 > [3] 80 94 15 > > [[2]] > NormalIRanges of length 3 > start end width > [1] 100 129 30 > [2] 150 159 10 > [3] 180 194 15 > > (ii) Or I can just ignore the embedded gap/exon structure: > IRanges of length 2 > start end width > [1] 1 94 94 > [2] 100 194 95 > > So -- like I said, it just seems like a more natural fit to me than a > GRangesList here -- and it's a bit less heavy weight than a > GRangesList as I don't need strand/seqinfo information for my gaps > since they are already present in the GRanges themselves. > > Hopefully that makes some sense? > > Yes, thanks. I see where you are going, and I'll sort of take your word for it about what's useful here. If you have any contributions for GappedRanges or any other part of IRanges, we'd be happy to incorporate them. Michael > -steve > > > [1] The package can also work with other tag-sequencing data. We're > also developing a "new" high throughput 3'RACE-like protocol to > sequence the very end of transcripts, and there is DeepCAGE (already > in the wild) data that I'd also like my package to be able to deal > with. > > (in case anybody was interested :-) > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact<http: cbio.mskc="" c.org="" %7elianos="" contact=""> > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 359 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6