Transitive overlaps of GRanges object
3
2
Entering edit mode
@stefanie-tauber-3978
Last seen 7 weeks ago
Germany

Hi,

I have a GRanges object und would like to retrieve sets of reciprocal overlaps:

library(GenomicRanges)

gr = GRanges(seqnames = rep("chr1",4),
ranges   = IRanges(start = c(1,2,3,8),
end  = c(12,6,6,10)),
strand = rep("+", 4))
> gr
GRanges object with 4 ranges and 0 metadata columns:
seqnames    ranges strand
<Rle> <IRanges>  <Rle>
[1]     chr1   [1, 12]      +
[2]     chr1   [2,  6]      +
[3]     chr1   [3,  6]      +
[4]     chr1   [8, 10]      +
-------

Range #1 overlaps with all other ranges, however I would like to get only those overlaps which hold in a transitive manner. such as: Range #1 overlaps with #2, #2 with #3 and #3 with #1.
At the moment I am filtering the output of findOverlaps (unsatisfactorily with the help of a loop..). Is there a more direct way to do this?

Stefanie

granges findOverlaps • 2.0k views
5
Entering edit mode
@sean-davis-490
Last seen 3 days ago
United States

The problem could be thought of in graph terms.  The nodes are the regions.  The edges consist of connecting all regions that overlap.  These edges are exactly what is returned by findOverlaps().  If you take these concepts and actually construct a graph, there are a number of packages that will construct the transitive closure.  This eliminates the need to do findOverlaps() in the loop.

Edit: Vince points out that the RBGL package has a transitive closure functionality; this should be pretty performant, I suspect.

Edit: Here is some code with a largish set of regions (10,000 exons) showing performance (about 5 seconds for the entire script, including transitive closure calculation).

# Form a large set of regions
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
exons = exons(TxDb.Hsapiens.UCSC.hg19.knownGene)[1:10000]
library(graph)
# Make a graph with edges representing overlapping regions
g = graphNEL(nodes=as.character(seq_along(exons)))
x = as.data.frame(findOverlaps(exons))
# Compute the transitive closure
library(RGBL)
system.time((g1 = transitive.closure(g)))
user  system elapsed
2.666   0.426   3.110
g1
# And use edges(g1) to capture the transitive relationships.
$1 [1] "1"$2
[1] "2" "3" "4"

$3 [1] "2" "3" "4"$4
[1] "2" "3" "4"

$5 [1] "5" "6"$6
[1] "5" "6"

2
Entering edit mode

I'll note that RBGL package has a transitive.closure function that operates on various types

of graph defined in the graph package.  Transforming the range data into a graph for this purpose

doesn't seem particularly hard, but in a naive approach there would be pairwise testing.

0
Entering edit mode

Thanks for pointing out the graph perspective. I will consider to reformulate the problem in this context!

0
Entering edit mode

What a neat, elegant way :) thanks a lot!

0
Entering edit mode

I wish all my posts were associated with "neat" and "elegant"--thanks!

1
Entering edit mode
@stefanie-tauber-3978
Last seen 7 weeks ago
Germany

Hi Herve,

I think one way to achieve what I would like to do is as follow:

ov = as.matrix(findOverlaps(gr, gr))
ranges_by_overlap = split(gr[ov[, 2]], ov[, 1])

lapply(ranges_by_overlap, function(x) Reduce(intersect,x))

By doing so, all 'correct' transitive overlaps will have a valid intersection as output from Reduce(intersect,x). All non-transitive overlaps will return an empty range.

However, this is unfortunately pretty slow...
Best,

Stefanie

0
Entering edit mode
@herve-pages-1542
Last seen 21 hours ago
Seattle, WA, United States

Hi Stephanie,

Not sure what you're trying to achieve exactly but note that the easiest way to "cluster ranges by overlap" is by calling reduce() with with.revmap=TRUE:

library(GenomicRanges)
gr <- GRanges(seqnames="chr1",
ranges=IRanges(c( 1, 2, 3,  8, 15, 20,  9),
c(12, 6, 6, 10, 20, 25, 13)),
strand="+")
rgr <- reduce(gr, min.gapwidth=0, with.revmap=TRUE)


Note that min.gapwidth is 1 by default and that means that adjacent ranges that do not overlap would be considered to be in the same cluster. So we use min.gapwidth=0 to avoid that.

Then:

> rgr
GRanges object with 2 ranges and 1 metadata column:
seqnames    ranges strand |        revmap
<Rle> <IRanges>  <Rle> | <IntegerList>
[1]     chr1  [ 1, 13]      + |     1,2,3,...
[2]     chr1  [15, 25]      + |           5,6
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths


rgr contains 1 range per cluster and its revmap metadata column contains the indices of the original ranges that belong to each cluster:

> mcols(rgr)$revmap IntegerList of length 2 [[1]] 1 2 3 4 7 [[2]] 5 6  Hope this helps, H. ADD COMMENT 0 Entering edit mode Hi Herve, thanks for pointing out the 'with.revmap' argument. I was not aware of that argument. In the end I would like to have the intersection of ranges for which the transitive overlap holds. I would compute the intersection of those ranges via Reduce(intersect, granges). However, I have troubles to get the ranges clustered by transitive overlaps. The problem with mcols(rgr)$revmap is that I retrieve again non-transitive overlap clusters.

Best,

Stefanie