Question: Merge multiple genomic intervals if overlapping more than xx percents
gravatar for samuel collombet
2.7 years ago by
samuel collombet110 wrote:


I have multiple set of genomic intervals (peaks from ChIPseq, at least 4 set of several 1000s of intervals), and I would like to merge them into one unique set of interval, where intervals overlapping are fused, if the overlap is more than 60% of the length of one of them. 

I guess I should be able to do that with GenomciRanges and the overlap function, but I do not find a way to do it (sorry I do not have a lot of experience with the package). Anyone would see how to do that?

ADD COMMENTlink modified 2.7 years ago by Hervé Pagès ♦♦ 13k • written 2.7 years ago by samuel collombet110
gravatar for Michael Lawrence
2.7 years ago by
United States
Michael Lawrence9.9k wrote:

These types of problems are most easily solved by detecting which ranges overlap, and then filtering those overlaps for those that match the criteria (60% overlap), and finally merging the intervals. You could call this completely untested function inside a Reduce() iteration:

mergeOverlapping <- function(x, y, minfrac=0.6) {
    x <- granges(x)
    y <- granges(y)
    hits <- findOverlaps(x, y)
    xhits <- x[queryHits(hits)]
    yhits <- y[subjectHits(hits)]
    frac <- width(pintersect(xhits, yhits)) / pmin(width(xhits), width(yhits))
    merge <- frac >= minfrac
    c(reduce(c(xhits[merge], yhits[merge])),
      xhits[!merge], yhits[!merge],
      x[-queryHits(hits)], y[-subjectHits(hits)])

Note that this drops all metadata columns. But it really seems like reduce() could gain some option for minimum absolute or relative overlap.


ADD COMMENTlink written 2.7 years ago by Michael Lawrence9.9k

Thanks Michael!

So the only way to merge overlap ranges from my 4 (or more) granges, is to perform all the pairwise comparison?
My problem is: if I have a range a in A that overlap a range b in B (with more than 60%), and a also overlap (>60%) a range c in C, I want then to merge then all into only 1 ranges... I guess I should compare A and B to get a new granges A', compare A' to C... and so forth. I guess this way there is no chance that I fuse a range c with a range a', when there would not have been a a or b with which I would have merged c?
(of course I can merge c with a' when c overlapping b but not a, but I want that to happen (I think))

ADD REPLYlink written 2.7 years ago by samuel collombet110

It might help if you describe your goal. What are you going to use these merged ranges for? My initial proposal was a progressive reduction, i.e., requiring overlap with the result of the previous merge. An alternative would be to require that each range have 60% overlap with every other range in its "cluster".

ADD REPLYlink written 2.7 years ago by Michael Lawrence9.9k

Basically I want to make a unique annotation of enriched regions, based on multiple ChIPseq experiment (I get enriched regions for each regions using macs for peak calling).
For now I just merge any overlapping regions from any experiments ; However, I have found that some times, I can find 2 peaks in 2 conditions that are close, overlap a little bit, but for which I know they are distinct functional entity. So  I would like to merge those that only have a relatively large overlap.

About the "that each range have 60% overlap with every other range in its "cluster"", this also give artifact, at sometimes a very small peak is called in one condition, but a much bigger one in another ; the small one can be entirely covered by the big one, but he only represented a small fraction of the second ; so they would not be merge. 

I think the progressive reduction should work fine :)

ADD REPLYlink written 2.7 years ago by samuel collombet110
gravatar for Hervé Pagès
2.7 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Samuel,

I guess this is not quite as high level as what Michael was hoping for (that is, the extra option to reduce() for controlling minimum relative overlap) but it gets us close. On the other hand it has the advantage to be more flexible by letting the user specify an arbitrary clustering (i.e. which ranges must be merged) via a Hits object:


## Extract clusters from Hits object.
extractClustersFromSelfHits <- function(hits)
    stopifnot(is(hits, "Hits"))
    stopifnot(queryLength(hits) == subjectLength(hits))
    hits <- union(hits, t(hits))
    qh <- queryHits(hits)
    sh <- subjectHits(hits)
    cid <- seq_len(queryLength(hits))  # cluster ids
    while (TRUE) {
        h <- Hits(qh, cid[sh],
                  queryLength(hits), subjectLength(hits))
        cid2 <- pmin(cid, selectHits(h, "first"))
        if (identical(cid2, cid))
        cid <- cid2
    unname(splitAsList(seq_len(queryLength(hits)), cid))

## Merge ranges that are "connected" (directly or indirectly)
## via a hit (or several hits) in 'hits'.
mergeConnectedRanges <- function(x, hits)
    stopifnot(is(x, "GenomicRanges"))
    stopifnot(is(hits, "Hits"))
    stopifnot(queryLength(hits) == subjectLength(hits))
    stopifnot(queryLength(hits) == length(x))
    clusters <- extractClustersFromSelfHits(hits)
    ans <- range(extractList(x, clusters))
    if (any(elementLengths(ans) != 1L))
        stop(wmsg("some connected ranges are not on the same ",
                  "chromosome and strand, and thus cannot be ",
    ans <- unlist(ans)
    mcols(ans)$revmap <- clusters

For your problem ("Merge multiple genomic intervals if overlapping more than xx percents"), it can be used as follow:

## Let's say the genomic intervals to merge are in GRanges object 'gr0'.
strand(gr0) <- "*"
hits <- findOverlaps(gr0)

## Subset 'hits' to keep only hits that achieve 60% overlap.
x <- gr0[queryHits(hits)]
y <- gr0[subjectHits(hits)]
relative_overlap <- width(pintersect(x, y)) / pmin(width(x), width(y))
hits <- hits[relative_overlap >= 0.6]

## Merge the ranges in 'gr0' that are connected via one or more hits in 'hits'.
gr1 <- mergeConnectedRanges(gr0, hits)

Note that the returned GRanges object has 1 element (i.e. 1 genomic range) per cluster and a metadata column revmap that maps each cluster to the corresponding elements in gr0. You can extract this metadata column with mcols(gr1)$revmap (it's an IntegerList).

Hope this helps,

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by Hervé Pagès ♦♦ 13k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 464 users visited in the last hour