Question

Understanding findoverlaps function in GenomicRanges

1

Entering edit mode

Alexandre ▴ 10

@095e334e

Last seen 2.7 years ago

Hong Kong

I would like to understand better the meaning of the indexes for both Granges and Grangeslist objects. I will post here what I think it's going on in this function and please do correct me if I'm wrong.

Let's assume two grange objects:

> g
GRanges object with 4 ranges and 2 metadata columns:
    seqnames    ranges strand |     score        GC
       <Rle> <IRanges>  <Rle> | <integer> <numeric>
  a     chr1   101-113      - |         1  1.000000
  b     chr2   102-112      + |         2  0.888889
  c     chr2   103-113      + |         3  0.777778
  d     chr2   150-200      * |         4  0.666667
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

and

> g2
GRanges object with 2 ranges and 2 metadata columns:
    seqnames    ranges strand |     score        GC
       <Rle> <IRanges>  <Rle> | <integer> <numeric>
  a     chr1   111-120      - |         1  1.000000
  b     chr2   102-112      + |         2  0.888889
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

If I use findoverlaps like that:

> findOverlaps(g,g2)
Hits object with 3 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  [2]         2           2
  [3]         3           2
  -------
  queryLength: 4 / subjectLength: 2

Here query is g and subject is g2 and the indexes(numbers) of queryhits from this Hits object which are 1 2 3 represent the index in the g object. Is that correct ? So 1 2 3 indexes in this case will be the first 3 rows of g, which are:

> g
GRanges object with 4 ranges and 2 metadata columns:
    seqnames    ranges strand |     score        GC
       <Rle> <IRanges>  <Rle> | <integer> <numeric>
  a     chr1   101-113      - |         1  1.000000
  b     chr2   102-112      + |         2  0.888889
  c     chr2   103-113      + |         3  0.777778

For the subjectHits these are the indexes of g2 which the query is hitting to, is that correct ? So the first index of g hits the first index of g2, the second hits the second and the third hits the second again.

If those statements are correct I think I got the idea of those indices in both queryHits and subjectHits columns.

Now things start to get confusing when using a grange object and grangelist, for example this grangelist:

> grl
GRangesList object of length 2:
$txA
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |     score        GC
         <Rle> <IRanges>  <Rle> | <integer> <numeric>
  [1]     chr2   103-106      + |         5      0.45
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

$txB
GRanges object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand |     score        GC
         <Rle> <IRanges>  <Rle> | <integer> <numeric>
  [1]     chr1   107-109      + |         3       0.3
  [2]     chr1   113-115      - |         4       0.5
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

If I use findoverlaps like that:

> findOverlaps(g,grl)
Hits object with 3 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           2
  [2]         2           1
  [3]         3           1
  -------
  queryLength: 4 / subjectLength: 2

I think the queryHits are following the same logic as before, but subjectHits is representing not the row indices of the different objects in the list but rather the list indices, is that correct ? So in this case 2 1 1 means that the first query is hitting a genomic region somewhere in the 2 index of this GrangeList object. This 2 is equal to $txB in this case.

I'm thinking about the possibilities of building a matrix of hits and missing hits using the genomicRanges package but I'm not sure if this function is appropriate for my use case:

I would like to have something like that when using findoverlaps(g,grl):

      queryHits     txa             txb
      <integer>   <logical>   <logical>
  [1]         1           0               1
  [2]         2           1               0
  [3]         3           1               0
  [4]         4           0               0

So here in this hypothetical case we have a logical vectors for both txa and txb objects , there queryhits correspond to the indexes of g and even if there are no hits the index should be included, for example index 4 doesn't hit any regions in txa and txb and so we have 0 0. For the first index it hits txb but not txa.

Any help is much appreciated!!! thanks

GenomicRanges genomicra • 7.1k views

ADD COMMENT • link updated 3.3 years ago by James W. MacDonald 68k • written 3.3 years ago by Alexandre ▴ 10

0

Entering edit mode

So if you have a GRangesList that is 100,000 items long you want to return a DataFrame that has dimension length(gr) x length(grl)? That will be a very sparse DataFrame, and seems pretty inefficient? Perhaps it might be better if you just said what your use case is, and maybe then people can provide input.

ADD REPLY • link 3.3 years ago James W. MacDonald 68k

0

Entering edit mode

Hi James,

I would deeply appreciate if people could provide input on my understanding of what findoverlaps does when using 2 granges objects and when using granges and grangelist, what really those indices represent in both scenarios. My GRangesList will have at most 5 items long for my use case, so I'm not expecting to produce a large sparse matrix from such computation.

ADD REPLY • link 3.3 years ago Alexandre ▴ 10

0

Entering edit mode

To see if your understanding is correct you can just read the help page, which says

Details:

     When the query and the subject are GRanges or GRangesList objects,
     'findOverlaps' uses the triplet (sequence name, range, strand) to
     determine which features (see paragraph below for the definition
     of feature) from the 'query' overlap which features in the
     'subject', where a strand value of '"*"' is treated as occurring
     on both the '"+"' and '"-"' strand.  An overlap is recorded when a
     feature in the 'query' and a feature in the 'subject' have the
     same sequence name, have a compatible pairing of strands (e.g.
     '"+"'/'"+"', '"-"'/'"-"', '"*"'/'"+"', '"*"'/'"-"', etc.), and
     satisfy the interval overlap requirements.

     In the context of 'findOverlaps', a feature is a collection of
     ranges that are treated as a single entity. For GRanges objects, a
     feature is a single range; while for GRangesList objects, a
     feature is a list element containing a set of ranges. In the
     results, the features are referred to by number, which run from 1
     to 'length(query)'/'length(subject)'.

     For 'type="equal"' with GRangesList objects, 'query[[i]]' matches
     'subject[[j]]' iff for each range in 'query[[i]]' there is an
     identical range in 'subject[[j]]', and vice versa.

Which I believe already answers your question?

ADD REPLY • link 3.3 years ago James W. MacDonald 68k