Question: Bug: readGff3 isRightOpen should be FALSE
0
4.4 years ago by
sjackman0
sjackman0 wrote:

GFF files are definitely closed intervals. The default value of isRightOpen should be FALSE, not TRUE.

Start and end positions are identified using a one-based index. The end position is included. For example, setting start-end to 1-2 describes two bases, the first and second in the sequence.

I found this previous post on a similar topic: gff files: how to tell if right-open interval convention used?

Thanks,

Shaun

genomeintervals • 800 views
modified 4.4 years ago by Nicolas Delhomme320 • written 4.4 years ago by sjackman0

Sounds like this discussion has already happened. You could use rtracklayer::import() instead, it does the right thing.

1
Hej Dan and Shawn! Given the imprecision of "The Sequence Ontology Project" in their gff3 format specification, genomeIntervals also does the "right" thing. Julien's original answer to Hervé's comment more than 4 years ago (what Shawn refers to) is still valid - it's a different interpretation of the format specification and is surely not less right ;-). Nowhere is stated whether intervals should be open or closed in GFF3 files. That left is closed transpires from the format description, but there are 2 sentences that may indicate that the right could be opened; i.e. those that refer to zero-length features. Anyway, no need to repeat that discussion and since meanwhile most gff3 files "de-facto" use closed intervals, it's about time for readGff3 to rally the community consensus. I'll modify the value of isRightOpen in the development version of the genomeIntervals package and document the change. I'll also have the method issue a warning to make sure no users gets affected by this. Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme, PhD The Street Lab Department of Plant Physiology Umeå Plant Science Center Tel: +46 90 786 5478 Email: nicolas.delhomme@umu.se SLU - Umeå universitet Umeå S-901 87 Sweden --------------------------------------------------------------- > On 20 Jun 2015, at 00:14, Dan Tenenbaum [bioc] <noreply@bioconductor.org> wrote: > > Activity on a post you are following on support.bioconductor.org > User Dan Tenenbaum wrote Comment: Bug: readGff3 isRightOpen should be FALSE: > > > Sounds like this discussion has already happened. You could use rtracklayer::import() instead, it does the right thing. > > > > You may reply via email or visit C: Bug: readGff3 isRightOpen should be FALSE >

Thanks, Nico!

0
4.4 years ago by
Sweden
Nicolas Delhomme320 wrote:
Hej Shawn! In genomeIntervals 1.25.1 (in devel), you'll find that I've: o Changed readGff3 to use closed intervals by default. Implemented two sub-functions that implement reading a gff3 as base-pair features only (no zero length intervals, i.e. right-closed intervals) or which allows for zero length intervals, i.e. right-open intervals, when start equals end) o Deprecated the seq_name accessors in favour of the BiocGenerics seqnames o Added a width accessor - similar to the IRanges functionality but taking into account the fact that genomeIntervals can deal with right-open or closed intervals. o Added coercion to GRangesList and RangedData HTH, Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme, PhD The Street Lab Department of Plant Physiology Umeå Plant Science Center Tel: +46 90 786 5478 Email: nicolas.delhomme@umu.se SLU - Umeå universitet Umeå S-901 87 Sweden --------------------------------------------------------------- > On 19 Jun 2015, at 00:28, sjackman [bioc] <noreply@bioconductor.org> wrote: > > Activity on a post you are following on support.bioconductor.org > User sjackman wrote Question: Bug: readGff3 isRightOpen should be FALSE: > > > GFF files are definitely closed intervals. The default value of isRightOpen should be FALSE, not TRUE. > > Here's one reference: https://www.broadinstitute.org/igv/GFF > > > Start and end positions are identified using a one-based index. The end position is included. For example, setting start-end to 1-2 describes two bases, the first and second in the sequence. > > I found this previous post on a similar topic: gff files: how to tell if right-open interval convention used? > > Thanks, > > Shaun > > > You may reply via email or visit Bug: readGff3 isRightOpen should be FALSE >

Hi Nico,

[6, 10), [6,9], (5, 10), and (5, 9] are four different notations for the same thing i.e. they all represent the set of integers {6,7,8,9}. So it is weird/confusing that their width is not the same:

library(genomeIntervals)
gi <- GenomeIntervals(start=c(6,6,5,5), end=c(10,9,10,9),
chromosome=rep("chr1", each=4),
leftOpen=c(FALSE, FALSE, TRUE, TRUE),
rightOpen=c(TRUE, FALSE, TRUE, FALSE))
gi
# Object of class Genome_intervals
# 4 base intervals and 0 inter-base intervals(*):
# chr1 [6, 10)
# chr1 [6, 9]
# chr1 (5, 10)
# chr1 (5, 9]
#
# annotation:
#  seq_name inter_base
# 1     chr1      FALSE
# 2     chr1      FALSE
# 3     chr1      FALSE
# 4     chr1      FALSE

width(gi)
# [1] 5 4 6 5


From a consistency/interoperability point of view, it would be good to use the same semantic as in the IRanges package where the width of an interval (or range) is the number of integer values that belong to the interval (see ?IRanges::width), i.e. 4 in that case.

Cheers,

H.

Thanks Hervé, I'll fix that. I had in mind gff3 cases when I created the width function so I overlooked the left-open possibility. Nico --------------------------------------------------------------- Nicolas Delhomme, PhD The Street Lab Department of Plant Physiology Umeå Plant Science Center Tel: +46 90 786 5478 Email: nicolas.delhomme@umu.se SLU - Umeå universitet Umeå S-901 87 Sweden --------------------------------------------------------------- > On 23 Jun 2015, at 22:02, Hervé Pagès [bioc] <noreply@bioconductor.org> wrote: > > Activity on a post you are following on support.bioconductor.org > User Hervé Pagès wrote Comment: Bug: readGff3 isRightOpen should be FALSE: > > > Hi Nico, > > [6, 10), [6,9], (5, 10), and (5, 9] are four different notations for the same thing i.e. they all represent the set of integers {6,7,8,9}. So it is weird/confusing that their width is not the same: > > library(genomeIntervals) > gi <- GenomeIntervals(start=c(6,6,5,5), end=c(10,9,10,9), > chromosome=rep("chr1", each=4), > leftOpen=c(FALSE, FALSE, TRUE, TRUE), > rightOpen=c(TRUE, FALSE, TRUE, FALSE)) > gi > # Object of class Genome_intervals > # 4 base intervals and 0 inter-base intervals(*): > # chr1 [6, 10) > # chr1 [6, 9] > # chr1 (5, 10) > # chr1 (5, 9] > # > # annotation: > # seq_name inter_base > # 1 chr1 FALSE > # 2 chr1 FALSE > # 3 chr1 FALSE > # 4 chr1 FALSE > > width(gi) > # [1] 5 4 6 5 > > From a consistency/interoperability point of view, it would be good to use the same semantic as in the IRanges package where the width of an interval (or range) is the number of integer values that belong to the interval (see ?IRanges::width), i.e. 4 in that case. > > Cheers, > > H. > > > Post tags: genomeintervals > > You may reply via email or visit C: Bug: readGff3 isRightOpen should be FALSE >
Thanks again Hervé for spotting this. I've fixed it in version 1.25.2. Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme, PhD The Street Lab Department of Plant Physiology Umeå Plant Science Center Tel: +46 90 786 5478 Email: nicolas.delhomme@umu.se SLU - Umeå universitet Umeå S-901 87 Sweden --------------------------------------------------------------- > On 24 Jun 2015, at 09:27, Nicolas Delhomme [bioc] <noreply@bioconductor.org> wrote: > > Activity on a post you are following on support.bioconductor.org > User Nicolas Delhomme wrote Comment: Bug: readGff3 isRightOpen should be FALSE: > > > Thanks Hervé, I'll fix that. I had in mind gff3 cases when I created the width function so I overlooked the left-open possibility. Nico --------------------------------------------------------------- Nicolas Delhomme, PhD The Street Lab Department of Plant Physiology Umeå Plant Science Center Tel: +46 90 786 5478 Email: nicolas.delhomme@umu.se SLU - Umeå universitet Umeå S-901 87 Sweden --------------------------------------------------------------- > On 23 Jun 2015, at 22:02, Hervé Pagès [bioc] <noreply@bioconductor.org> wrote: > > Activity on a post you are following on support.bioconductor.org > User Hervé Pagès wrote Comment: Bug: readGff3 isRightOpen should be FALSE: > > > Hi Nico, > > [6, 10), [6,9], (5, 10), and (5, 9] are four different notations for the same thing i.e. they all represent the set of integers {6,7,8,9}. So it is weird/confusing that their width is not the same: > > library(genomeIntervals) > gi <- GenomeIntervals(start=c(6,6,5,5), end=c(10,9,10,9), > chromosome=rep("chr1", each=4), > leftOpen=c(FALSE, FALSE, TRUE, TRUE), > rightOpen=c(TRUE, FALSE, TRUE, FALSE)) > gi > # Object of class Genome_intervals > # 4 base intervals and 0 inter-base intervals(*): > # chr1 [6, 10) > # chr1 [6, 9] > # chr1 (5, 10) > # chr1 (5, 9] > # > # annotation: > # seq_name inter_base > # 1 chr1 FALSE > # 2 chr1 FALSE > # 3 chr1 FALSE > # 4 chr1 FALSE > > width(gi) > # [1] 5 4 6 5 > > From a consistency/interoperability point of view, it would be good to use the same semantic as in the IRanges package where the width of an interval (or range) is the number of integer values that belong to the interval (see ?IRanges::width), i.e. 4 in that case. > > Cheers, > > H. > > > Post tags: genomeintervals > > You may reply via email or visit C: Bug: readGff3 isRightOpen should be FALSE > > > Post tags: genomeintervals > > You may reply via email or visit C: Bug: readGff3 isRightOpen should be FALSE >

Great. Thanks for making that change.

H.