Search
Question: IRanges (GRanges): Inconsistent behaviour when overlapping ranges of width 0
1
14 months ago by
balwierz40
United Kingdom
balwierz40 wrote:

I will describe the problem using GRanges, but it is solely within IRanges. Ranges of width are apparently allowed. You can obtain them for instance calling promoters(TxDb, upstream=0, downstream=0)

Specifically, these ranges cannot be found by distanceToNearest _if_ they are within the subject, but are easily found if they lie outside.
They also do not appear in subsetByOverlaps and findOverlaps unless you change minoverlap from 1L (default) to 0L.

IMO minoverlap should be set to 0L by default in functions which use it, and distanceToNearest should be able to handle these ranges properly. Example code below.

Out of curiosity: how are IRanges implemented internally? Are indices following Jim Kent "UCSC" convention, or is it internally handled as as in user functions?

   a = GRanges(c("a:2-2", "a:4-3", "a:12-13", "a:15-14"))
b = GRanges("a:1-10")
a
GRanges object with 4 ranges and 0 metadata columns:
seqnames    ranges strand
<Rle> <IRanges>  <Rle>
[1]        a  [ 2,  2]      *
[2]        a  [ 4,  3]      *
[3]        a  [12, 13]      *
[4]        a  [15, 14]      *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

width(a)
[1] 1 0 2 0

b
GRanges object with 1 range and 0 metadata columns:
seqnames    ranges strand
<Rle> <IRanges>  <Rle>
[1]        a   [1, 10]      *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

distanceToNearest(a,b)
Hits object with 3 hits and 1 metadata column:
queryHits subjectHits |  distance
<integer>   <integer> | <integer>
[1]         1           1 |         0
[2]         3           1 |         1
[3]         4           1 |         4
-------
queryLength: 4 / subjectLength: 1

subsetByOverlaps(a,b)
GRanges object with 1 range and 0 metadata columns:
seqnames    ranges strand
<Rle> <IRanges>  <Rle>
[1]        a    [2, 2]      *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

subsetByOverlaps(a,b, minoverlap=0)
GRanges object with 2 ranges and 0 metadata columns:
seqnames    ranges strand
<Rle> <IRanges>  <Rle>
[1]        a    [2, 2]      *
[2]        a    [4, 3]      *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
modified 13 months ago by Hervé Pagès ♦♦ 13k • written 14 months ago by balwierz40

Hi,

Thanks for the report. The current behavior of distanceToNearest() w.r.t. zero-width ranges seems wrong. I'll look into this.

FWIW an IRanges object uses 2 parallel slots to represent the ranges: one slot for the 1-based starts, and one slot for the widths.

H.

0
13 months ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi,

distanceToNearest() (and nearest(), which was also affected) now calls findOverlaps() internally with minoverlap=1 instead of minoverlap=0. With this change, the following code now behaves as expected:

library(GenomicRanges)
a <- GRanges(c("a:2-2", "a:4-3", "a:12-13", "a:15-14"))
b <- GRanges("a:1-10")
distanceToNearest(a,b)
# Hits object with 4 hits and 1 metadata column:
#       queryHits subjectHits |  distance
#       <integer>   <integer> | <integer>
#   [1]         1           1 |         0
#   [2]         2           1 |         0
#   [3]         3           1 |         1
#   [4]         4           1 |         4
#   -------
#   queryLength: 4 / subjectLength: 1

This change also fixes the behavior of nearest() in the following situation:

query <- GRanges("chr1", IRanges(5, 10))
subject <- GRanges("chr1", IRanges(1, 4:5))
nearest(query, subject, select="all")
# Hits object with 2 hits and 0 metadata columns:
#       queryHits subjectHits
#       <integer>   <integer>
#   [1]         1           1
#   [2]         1           2
#   -------
#  queryLength: 1 / subjectLength: 2


The 2 ranges in the subject now are both considered to be nearest to the 5-10 range (instead of only the 2nd one).

The fixes are in IRanges 2.10.3 (release) and 2.11.14 (devel), and in GenomicRanges 1.28.5 (release) and 1.29.13 (devel).

For subsetByOverlaps(), my inclination would be to just change the minoverlap default from 1 to 0 in the interface of the generic and all its methods. With this change zero-width ranges in the 1st argument (i.e. in the object to subset) won't be lost anymore. However it's important to realize that even if the 1st argument doesn't contain zero-width ranges, this change will sometimes alter the results obtained so far because ranges that are adjacent to (but not strictly overlapping with) the 2nd argument will now be kept. For example:

  subsetByOverlaps(IRanges(9:12, 15), IRanges(1, 10))


will return IRanges(9:11, 15) instead of IRanges(9:10, 15).

I'll post again here when it's done.

H.

So I made that change to subsetByOverlaps() but in devel only (in IRanges 2.11.15). Note that it's now consistent with the snpsByOverlaps() function from the BSgenome package which has been using minoverlap=0 by default since the beginning. Other *ByOverlaps() functions (e.g. cdsByOverlaps() in the GenomicFeatures package) would need to be modified to do this too but I'll keep this for another time.

Cheers,

H.

Hi,

After more discussion on this (see here https://github.com/Bioconductor/IRanges/issues/1), I've decided to change the maxgap and minoverlap defaults to -1 and 0, respectively, for all the members of the findOverlaps() family. With this change everybody behaves consistently with respect to zero-width ranges and adjacent ranges. In particular, by default, nobody ignores zero-width ranges or counts adjacent ranges as overlapping. Use minoverlap=1 to ignore zero-width ranges or maxgap=0 to count adjacent ranges as overlapping.

The change is in IRanges 2.11.16, GenomicRanges 1.29.14, GenomicAlignments 1.13.6, and SummarizedExperiment 1.7.8. There are more Bioconductor packages that define methods for members of the findOverlaps() family and they will probably need to be modified too.

See ?findOverlaps (in IRanges >= 2.11.16) for more information about maxgap and the meaning of maxgap=-1.

Cheers,

H.