Question

Merge multiple genomic intervals if overlapping more than xx percents

0

Entering edit mode

samuel collombet ▴ 140

@samuel-collombet-6574

Last seen 8.6 years ago

France

Hi,

I have multiple set of genomic intervals (peaks from ChIPseq, at least 4 set of several 1000s of intervals), and I would like to merge them into one unique set of interval, where intervals overlapping are fused, if the overlap is more than 60% of the length of one of them.

I guess I should be able to do that with GenomciRanges and the overlap function, but I do not find a way to do it (sorry I do not have a lot of experience with the package). Anyone would see how to do that?

genomicranges • 11k views

ADD COMMENT • link written 10.6 years ago by samuel collombet ▴ 140

score 3 · Answer 1 · 2015-05-26

3

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 4.0 years ago

United States

These types of problems are most easily solved by detecting which ranges overlap, and then filtering those overlaps for those that match the criteria (60% overlap), and finally merging the intervals. You could call this completely untested function inside a Reduce() iteration:

mergeOverlapping <- function(x, y, minfrac=0.6) {
    x <- granges(x)
    y <- granges(y)
    hits <- findOverlaps(x, y)
    xhits <- x[queryHits(hits)]
    yhits <- y[subjectHits(hits)]
    frac <- width(pintersect(xhits, yhits)) / pmin(width(xhits), width(yhits))
    merge <- frac >= minfrac
    c(reduce(c(xhits[merge], yhits[merge])),
      xhits[!merge], yhits[!merge],
      x[-queryHits(hits)], y[-subjectHits(hits)])
}

Note that this drops all metadata columns. But it really seems like reduce() could gain some option for minimum absolute or relative overlap.

ADD COMMENT • link 10.6 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thanks Michael!

So the only way to merge overlap ranges from my 4 (or more) granges, is to perform all the pairwise comparison?
My problem is: if I have a range a in A that overlap a range b in B (with more than 60%), and a also overlap (>60%) a range c in C, I want then to merge then all into only 1 ranges... I guess I should compare A and B to get a new granges A', compare A' to C... and so forth. I guess this way there is no chance that I fuse a range c with a range a', when there would not have been a a or b with which I would have merged c?
(of course I can merge c with a' when c overlapping b but not a, but I want that to happen (I think))

ADD REPLY • link 10.6 years ago samuel collombet ▴ 140

0

Entering edit mode

It might help if you describe your goal. What are you going to use these merged ranges for? My initial proposal was a progressive reduction, i.e., requiring overlap with the result of the previous merge. An alternative would be to require that each range have 60% overlap with every other range in its "cluster".

ADD REPLY • link 10.6 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Basically I want to make a unique annotation of enriched regions, based on multiple ChIPseq experiment (I get enriched regions for each regions using macs for peak calling).
For now I just merge any overlapping regions from any experiments ; However, I have found that some times, I can find 2 peaks in 2 conditions that are close, overlap a little bit, but for which I know they are distinct functional entity. So I would like to merge those that only have a relatively large overlap.

About the "that each range have 60% overlap with every other range in its "cluster"", this also give artifact, at sometimes a very small peak is called in one condition, but a much bigger one in another ; the small one can be entirely covered by the big one, but he only represented a small fraction of the second ; so they would not be merge.

I think the progressive reduction should work fine :)

ADD REPLY • link 10.6 years ago samuel collombet ▴ 140

score 2 · Answer 2 · 2015-05-28

Hi Samuel,

I guess this is not quite as high level as what Michael was hoping for (that is, the extra option to reduce() for controlling minimum relative overlap) but it gets us close. On the other hand it has the advantage to be more flexible by letting the user specify an arbitrary clustering (i.e. which ranges must be merged) via a Hits object:

library(GenomicRanges)

## Extract clusters from Hits object.
extractClustersFromSelfHits <- function(hits)
{
    stopifnot(is(hits, "Hits"))
    stopifnot(queryLength(hits) == subjectLength(hits))
    hits <- union(hits, t(hits))
    qh <- queryHits(hits)
    sh <- subjectHits(hits)
    cid <- seq_len(queryLength(hits))  # cluster ids
    while (TRUE) {
        h <- Hits(qh, cid[sh],
                  queryLength(hits), subjectLength(hits))
        cid2 <- pmin(cid, selectHits(h, "first"))
        if (identical(cid2, cid))
            break
        cid <- cid2
    }
    unname(splitAsList(seq_len(queryLength(hits)), cid))
}

## Merge ranges that are "connected" (directly or indirectly)
## via a hit (or several hits) in 'hits'.
mergeConnectedRanges <- function(x, hits)
{
    stopifnot(is(x, "GenomicRanges"))
    stopifnot(is(hits, "Hits"))
    stopifnot(queryLength(hits) == subjectLength(hits))
    stopifnot(queryLength(hits) == length(x))
    clusters <- extractClustersFromSelfHits(hits)
    ans <- range(extractList(x, clusters))
    if (any(elementLengths(ans) != 1L))
        stop(wmsg("some connected ranges are not on the same ",
                  "chromosome and strand, and thus cannot be ",
                  "merged"))
    ans <- unlist(ans)
    mcols(ans)$revmap <- clusters
    ans
}

For your problem ("Merge multiple genomic intervals if overlapping more than xx percents"), it can be used as follow:

## Let's say the genomic intervals to merge are in GRanges object 'gr0'.
strand(gr0) <- "*"
hits <- findOverlaps(gr0)

## Subset 'hits' to keep only hits that achieve 60% overlap.
x <- gr0[queryHits(hits)]
y <- gr0[subjectHits(hits)]
relative_overlap <- width(pintersect(x, y)) / pmin(width(x), width(y))
hits <- hits[relative_overlap >= 0.6]

## Merge the ranges in 'gr0' that are connected via one or more hits in 'hits'.
gr1 <- mergeConnectedRanges(gr0, hits)
gr1

Note that the returned GRanges object has 1 element (i.e. 1 genomic range) per cluster and a metadata column revmap that maps each cluster to the corresponding elements in gr0. You can extract this metadata column with mcols(gr1)$revmap (it's an IntegerList).

Hope this helps,
H.