Search
Question: Transitive overlaps of GRanges object
2
3.5 years ago by
Germany
Stefanie Tauber40 wrote:

Hi,

I have a GRanges object und would like to retrieve sets of reciprocal overlaps:

library(GenomicRanges)

gr = GRanges(seqnames = rep("chr1",4),
ranges   = IRanges(start = c(1,2,3,8),
end  = c(12,6,6,10)),
strand = rep("+", 4))
> gr
GRanges object with 4 ranges and 0 metadata columns:
seqnames    ranges strand
<Rle> <IRanges>  <Rle>
[1]     chr1   [1, 12]      +
[2]     chr1   [2,  6]      +
[3]     chr1   [3,  6]      +
[4]     chr1   [8, 10]      +
-------

Range #1 overlaps with all other ranges, however I would like to get only those overlaps which hold in a transitive manner. such as: Range #1 overlaps with #2, #2 with #3 and #3 with #1.
At the moment I am filtering the output of findOverlaps (unsatisfactorily with the help of a loop..). Is there a more direct way to do this?

Stefanie

modified 3.5 years ago by Hervé Pagès ♦♦ 13k • written 3.5 years ago by Stefanie Tauber40
5
3.5 years ago by
Sean Davis21k
United States
Sean Davis21k wrote:

The problem could be thought of in graph terms.  The nodes are the regions.  The edges consist of connecting all regions that overlap.  These edges are exactly what is returned by findOverlaps().  If you take these concepts and actually construct a graph, there are a number of packages that will construct the transitive closure.  This eliminates the need to do findOverlaps() in the loop.

Edit: Vince points out that the RBGL package has a transitive closure functionality; this should be pretty performant, I suspect.

Edit: Here is some code with a largish set of regions (10,000 exons) showing performance (about 5 seconds for the entire script, including transitive closure calculation).

# Form a large set of regions
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
exons = exons(TxDb.Hsapiens.UCSC.hg19.knownGene)[1:10000]
library(graph)
# Make a graph with edges representing overlapping regions
g = graphNEL(nodes=as.character(seq_along(exons)))
x = as.data.frame(findOverlaps(exons))
# Compute the transitive closure
library(RGBL)
system.time((g1 = transitive.closure(g)))
user  system elapsed
2.666   0.426   3.110
g1
# And use edges(g1) to capture the transitive relationships.
$1 [1] "1"$2
[1] "2" "3" "4"

$3 [1] "2" "3" "4"$4
[1] "2" "3" "4"

$5 [1] "5" "6"$6
[1] "5" "6"

2

I'll note that RBGL package has a transitive.closure function that operates on various types

of graph defined in the graph package.  Transforming the range data into a graph for this purpose

doesn't seem particularly hard, but in a naive approach there would be pairwise testing.

Thanks for pointing out the graph perspective. I will consider to reformulate the problem in this context!

What a neat, elegant way :) thanks a lot!

I wish all my posts were associated with "neat" and "elegant"--thanks!

1
3.5 years ago by
Germany
Stefanie Tauber40 wrote:

Hi Herve,

I think one way to achieve what I would like to do is as follow:

ov = as.matrix(findOverlaps(gr, gr))
ranges_by_overlap = split(gr[ov[, 2]], ov[, 1])

lapply(ranges_by_overlap, function(x) Reduce(intersect,x))

By doing so, all 'correct' transitive overlaps will have a valid intersection as output from Reduce(intersect,x). All non-transitive overlaps will return an empty range.

However, this is unfortunately pretty slow...
Best,

Stefanie

0
3.5 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Stephanie,

Not sure what you're trying to achieve exactly but note that the easiest way to "cluster ranges by overlap" is by calling reduce() with with.revmap=TRUE:

library(GenomicRanges)
gr <- GRanges(seqnames="chr1",
ranges=IRanges(c( 1, 2, 3,  8, 15, 20,  9),
c(12, 6, 6, 10, 20, 25, 13)),
strand="+")
rgr <- reduce(gr, min.gapwidth=0, with.revmap=TRUE)


Note that min.gapwidth is 1 by default and that means that adjacent ranges that do not overlap would be considered to be in the same cluster. So we use min.gapwidth=0 to avoid that.

Then:

> rgr
GRanges object with 2 ranges and 1 metadata column:
seqnames    ranges strand |        revmap
<Rle> <IRanges>  <Rle> | <IntegerList>
[1]     chr1  [ 1, 13]      + |     1,2,3,...
[2]     chr1  [15, 25]      + |           5,6
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths


rgr contains 1 range per cluster and its revmap metadata column contains the indices of the original ranges that belong to each cluster:

> mcols(rgr)$revmap IntegerList of length 2 [[1]] 1 2 3 4 7 [[2]] 5 6  Hope this helps, H. ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Hervé Pagès ♦♦ 13k Hi Herve, thanks for pointing out the 'with.revmap' argument. I was not aware of that argument. In the end I would like to have the intersection of ranges for which the transitive overlap holds. I would compute the intersection of those ranges via Reduce(intersect, granges). However, I have troubles to get the ranges clustered by transitive overlaps. The problem with mcols(rgr)$revmap is that I retrieve again non-transitive overlap clusters.

Best,

Stefanie