Semantics of GenomicRanges gaps()
1
0
Entering edit mode
@herve-pages-1542
Last seen 15 hours ago
Seattle, WA, United States
Hi Alex, On 05/31/2013 01:46 AM, Alex Gutteridge wrote: > I just have a quick question/comment about the behaviour of the gaps > function in the GenomicRanges package, particularly with how it handles > * strand ranges. The current behaviour is as below: > >> range = GRanges(seqnames="1",IRanges(start=300,end=700), strand="*", >> seqinfo=Seqinfo("1",seqlengths=1000)) >> range > GRanges with 1 range and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > [1] 1 [300, 700] * > --- > seqlengths: > 1 > 1000 >> gaps(range) > GRanges with 4 ranges and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > [1] 1 [ 1, 1000] + > [2] 1 [ 1, 1000] - > [3] 1 [ 1, 299] * > [4] 1 [701, 1000] * > --- > seqlengths: > 1 > 1000 > > My interpretation of "*" as a strand identifier is that it means the > range covers both + and - strands and so intuitively I was expecting the > 'gaps' to only cover the 3rd and 4th ranges returned above (not the > full-length + and - strand ranges). It's even worse than that. If there is no range at all on a chromosome, gaps(gr) will return 3 ranges covering the full chromosome: > gr <- GRanges("chr1", IRanges(start=300,end=700), seqlengths=c(chr1=1000,chr2=1000)) > gr GRanges with 1 range and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] chr1 [300, 700] * --- seqlengths: chr1 chr2 1000 1000 > gaps(gr) GRanges with 7 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] chr1 [ 1, 1000] + [2] chr1 [ 1, 1000] - [3] chr1 [ 1, 299] * [4] chr1 [701, 1000] * [5] chr2 [ 1, 1000] + [6] chr2 [ 1, 1000] - [7] chr2 [ 1, 1000] * --- seqlengths: chr1 chr2 1000 1000 > The semantics here implies to me > that the * strand is being thought of as a kind of imaginary third > strand and hence doesn't overlap with the + or - strands. This is > contrary to the semantics of findOverlaps which does appear to consider > a * strand range to overlap with + or - strand ranges: > >> findOverlaps(range,gaps(range)) > Hits of length 2 > queryLength: 1 > subjectLength: 4 > queryHits subjectHits > <integer> <integer> > 1 1 1 > 2 1 2 > > To summarise I guess I was expecting findOverlaps(range,gaps(range)) to > never return any overlaps under any circumstance (that I can think of!). Let's call this property 2. This sounds indeed like a good property that it would be nice to have. But an even more important property is property 1: gaps() must be its own reverse operation i.e. 'gaps(gaps(gr))' must always return 'gr'. The current behavior of gaps() guarantees property 1. I'm not against changing gaps() behavior to also have property 2, as long as property 1 is preserved. Suggestions are welcome. Thanks, H. > Do people agree the behaviour of gaps() is not quite intuitive in this > case? > >> sessionInfo() > R version 2.15.2 (2012-10-26) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicRanges_1.10.7 IRanges_1.16.6 BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] parallel_2.15.2 stats4_2.15.2 > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
Cancer GenomicRanges Cancer GenomicRanges • 872 views
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 2.4 years ago
United States
Not sure how confusing this would be, but a common case is (like Alex's input) a set of ranges where the strand is uniform, i.e., all ranges have strand 'x'. In that case, gaps,GenomicRanges could behave just like gaps,Ranges and return only ranges on strand 'x'. That would satisfy both properties. A natural extension would be that if all ranges are '+' or '-' (none of them are '*'), then do not add the '*' range spanning the whole sequence. Also satisfies both properties. I only know of one use-case where all three strand types are mixed: rna-seq, where the strand has been inferred only from the spliced alignments. Not if gaps() would even be useful in that case, but if it were, I think the current gaps behavior would make sense. Michael On Sun, Jun 2, 2013 at 11:36 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Alex, > > > On 05/31/2013 01:46 AM, Alex Gutteridge wrote: > >> I just have a quick question/comment about the behaviour of the gaps >> function in the GenomicRanges package, particularly with how it handles >> * strand ranges. The current behaviour is as below: >> >> range = GRanges(seqnames="1",IRanges(**start=300,end=700), strand="*", >>> seqinfo=Seqinfo("1",**seqlengths=1000)) >>> range >>> >> GRanges with 1 range and 0 metadata columns: >> seqnames ranges strand >> <rle> <iranges> <rle> >> [1] 1 [300, 700] * >> --- >> seqlengths: >> 1 >> 1000 >> >>> gaps(range) >>> >> GRanges with 4 ranges and 0 metadata columns: >> seqnames ranges strand >> <rle> <iranges> <rle> >> [1] 1 [ 1, 1000] + >> [2] 1 [ 1, 1000] - >> [3] 1 [ 1, 299] * >> [4] 1 [701, 1000] * >> --- >> seqlengths: >> 1 >> 1000 >> >> My interpretation of "*" as a strand identifier is that it means the >> range covers both + and - strands and so intuitively I was expecting the >> 'gaps' to only cover the 3rd and 4th ranges returned above (not the >> full-length + and - strand ranges). >> > > It's even worse than that. If there is no range at all on a chromosome, > gaps(gr) will return 3 ranges covering the full chromosome: > > > gr <- GRanges("chr1", > IRanges(start=300,end=700), > seqlengths=c(chr1=1000,chr2=**1000)) > > > gr > > GRanges with 1 range and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr1 [300, 700] * > --- > seqlengths: > chr1 chr2 > 1000 1000 > > > gaps(gr) > > GRanges with 7 ranges and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr1 [ 1, 1000] + > [2] chr1 [ 1, 1000] - > [3] chr1 [ 1, 299] * > [4] chr1 [701, 1000] * > [5] chr2 [ 1, 1000] + > [6] chr2 [ 1, 1000] - > [7] chr2 [ 1, 1000] * > --- > seqlengths: > chr1 chr2 > 1000 1000 > > > The semantics here implies to me >> that the * strand is being thought of as a kind of imaginary third >> strand and hence doesn't overlap with the + or - strands. This is >> contrary to the semantics of findOverlaps which does appear to consider >> a * strand range to overlap with + or - strand ranges: >> >> findOverlaps(range,gaps(range)**) >>> >> Hits of length 2 >> queryLength: 1 >> subjectLength: 4 >> queryHits subjectHits >> <integer> <integer> >> 1 1 1 >> 2 1 2 >> >> To summarise I guess I was expecting findOverlaps(range,gaps(range)**) to >> never return any overlaps under any circumstance (that I can think of!). >> > > Let's call this property 2. This sounds indeed like a good property > that it would be nice to have. But an even more important property is > property 1: gaps() must be its own reverse operation i.e. > 'gaps(gaps(gr))' must always return 'gr'. > > The current behavior of gaps() guarantees property 1. I'm not against > changing gaps() behavior to also have property 2, as long as property > 1 is preserved. Suggestions are welcome. > > Thanks, > H. > > > Do people agree the behaviour of gaps() is not quite intuitive in this >> case? >> >> sessionInfo() >>> >> R version 2.15.2 (2012-10-26) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] GenomicRanges_1.10.7 IRanges_1.16.6 BiocGenerics_0.4.0 >> >> loaded via a namespace (and not attached): >> [1] parallel_2.15.2 stats4_2.15.2 >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]
ADD COMMENT

Login before adding your answer.

Traffic: 780 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6