Search
Question: IRanges (GRanges): Inconsistent behaviour when overlapping ranges of width 0
1
gravatar for balwierz
5 weeks ago by
balwierz40
United Kingdom
balwierz40 wrote:

I will describe the problem using GRanges, but it is solely within IRanges. Ranges of width are apparently allowed. You can obtain them for instance calling `promoters(TxDb, upstream=0, downstream=0)`  

 

Specifically, these ranges cannot be found by `distanceToNearest` _if_ they are within the subject, but are easily found if they lie outside.
They also do not appear in `subsetByOverlaps` and `findOverlaps` unless you change minoverlap from 1L (default) to 0L.

IMO minoverlap should be set to 0L by default in functions which use it, and distanceToNearest should be able to handle these ranges properly. Example code below.

Out of curiosity: how are IRanges implemented internally? Are indices following Jim Kent "UCSC" convention, or is it internally handled as as in user functions?

   a = GRanges(c("a:2-2", "a:4-3", "a:12-13", "a:15-14"))
   b = GRanges("a:1-10")
   a
GRanges object with 4 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        a  [ 2,  2]      *
  [2]        a  [ 4,  3]      *
  [3]        a  [12, 13]      *
  [4]        a  [15, 14]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
 
   width(a)
[1] 1 0 2 0

   b
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        a   [1, 10]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
 
   distanceToNearest(a,b)
Hits object with 3 hits and 1 metadata column:
      queryHits subjectHits |  distance
      <integer>   <integer> | <integer>
  [1]         1           1 |         0
  [2]         3           1 |         1
  [3]         4           1 |         4
  -------
  queryLength: 4 / subjectLength: 1
 
   subsetByOverlaps(a,b)
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        a    [2, 2]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
 
   subsetByOverlaps(a,b, minoverlap=0)
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        a    [2, 2]      *
  [2]        a    [4, 3]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
ADD COMMENTlink modified 18 days ago by Hervé Pagès ♦♦ 12k • written 5 weeks ago by balwierz40

Hi,

Thanks for the report. The current behavior of distanceToNearest() w.r.t. zero-width ranges seems wrong. I'll look into this.

FWIW an IRanges object uses 2 parallel slots to represent the ranges: one slot for the 1-based starts, and one slot for the widths.

H.

ADD REPLYlink written 26 days ago by Hervé Pagès ♦♦ 12k
0
gravatar for Hervé Pagès
18 days ago by
Hervé Pagès ♦♦ 12k
United States
Hervé Pagès ♦♦ 12k wrote:

Hi,

distanceToNearest() (and nearest(), which was also affected) now calls findOverlaps() internally with minoverlap=1 instead of minoverlap=0. With this change, the following code now behaves as expected:

library(GenomicRanges)
a <- GRanges(c("a:2-2", "a:4-3", "a:12-13", "a:15-14"))
b <- GRanges("a:1-10")
distanceToNearest(a,b)
# Hits object with 4 hits and 1 metadata column:
#       queryHits subjectHits |  distance
#       <integer>   <integer> | <integer>
#   [1]         1           1 |         0
#   [2]         2           1 |         0
#   [3]         3           1 |         1
#   [4]         4           1 |         4
#   -------
#   queryLength: 4 / subjectLength: 1

This change also fixes the behavior of nearest() in the following situation:

query <- GRanges("chr1", IRanges(5, 10))
subject <- GRanges("chr1", IRanges(1, 4:5))
nearest(query, subject, select="all")
# Hits object with 2 hits and 0 metadata columns:
#       queryHits subjectHits
#       <integer>   <integer>
#   [1]         1           1
#   [2]         1           2
#   -------
#  queryLength: 1 / subjectLength: 2

The 2 ranges in the subject now are both considered to be nearest to the 5-10 range (instead of only the 2nd one).

The fixes are in IRanges 2.10.3 (release) and 2.11.14 (devel), and in GenomicRanges 1.28.5 (release) and 1.29.13 (devel).

For subsetByOverlaps(), my inclination would be to just change the minoverlap default from 1 to 0 in the interface of the generic and all its methods. With this change zero-width ranges in the 1st argument (i.e. in the object to subset) won't be lost anymore. However it's important to realize that even if the 1st argument doesn't contain zero-width ranges, this change will sometimes alter the results obtained so far because ranges that are adjacent to (but not strictly overlapping with) the 2nd argument will now be kept. For example:

  subsetByOverlaps(IRanges(9:12, 15), IRanges(1, 10))

will return IRanges(9:11, 15) instead of IRanges(9:10, 15).

I'll post again here when it's done.

Thanks for your feedback,

H.

ADD COMMENTlink modified 17 days ago • written 18 days ago by Hervé Pagès ♦♦ 12k

So I made that change to subsetByOverlaps() but in devel only (in IRanges 2.11.15). Note that it's now consistent with the snpsByOverlaps() function from the BSgenome package which has been using minoverlap=0 by default since the beginning. Other *ByOverlaps() functions (e.g. cdsByOverlaps() in the GenomicFeatures package) would need to be modified to do this too but I'll keep this for another time.

Cheers,

H.

ADD REPLYlink written 17 days ago by Hervé Pagès ♦♦ 12k

Hi,

After more discussion on this (see here https://github.com/Bioconductor/IRanges/issues/1), I've decided to change the maxgap and minoverlap defaults to -1 and 0, respectively, for all the members of the findOverlaps() family. With this change everybody behaves consistently with respect to zero-width ranges and adjacent ranges. In particular, by default, nobody ignores zero-width ranges or counts adjacent ranges as overlapping. Use minoverlap=1 to ignore zero-width ranges or maxgap=0 to count adjacent ranges as overlapping.

The change is in IRanges 2.11.16, GenomicRanges 1.29.14, GenomicAlignments 1.13.6, and SummarizedExperiment 1.7.8. There are more Bioconductor packages that define methods for members of the findOverlaps() family and they will probably need to be modified too.

See ?findOverlaps (in IRanges >= 2.11.16) for more information about maxgap and the meaning of maxgap=-1.

Cheers,

H.

ADD REPLYlink written 10 days ago by Hervé Pagès ♦♦ 12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 102 users visited in the last hour