Hi Samuel,
I guess this is not quite as high level as what Michael was hoping for (that is, the extra option to reduce()
for controlling minimum relative overlap) but it gets us close. On the other hand it has the advantage to be more flexible by letting the user specify an arbitrary clustering (i.e. which ranges must be merged) via a Hits object:
library(GenomicRanges)
## Extract clusters from Hits object.
extractClustersFromSelfHits <- function(hits)
{
stopifnot(is(hits, "Hits"))
stopifnot(queryLength(hits) == subjectLength(hits))
hits <- union(hits, t(hits))
qh <- queryHits(hits)
sh <- subjectHits(hits)
cid <- seq_len(queryLength(hits)) # cluster ids
while (TRUE) {
h <- Hits(qh, cid[sh],
queryLength(hits), subjectLength(hits))
cid2 <- pmin(cid, selectHits(h, "first"))
if (identical(cid2, cid))
break
cid <- cid2
}
unname(splitAsList(seq_len(queryLength(hits)), cid))
}
## Merge ranges that are "connected" (directly or indirectly)
## via a hit (or several hits) in 'hits'.
mergeConnectedRanges <- function(x, hits)
{
stopifnot(is(x, "GenomicRanges"))
stopifnot(is(hits, "Hits"))
stopifnot(queryLength(hits) == subjectLength(hits))
stopifnot(queryLength(hits) == length(x))
clusters <- extractClustersFromSelfHits(hits)
ans <- range(extractList(x, clusters))
if (any(elementLengths(ans) != 1L))
stop(wmsg("some connected ranges are not on the same ",
"chromosome and strand, and thus cannot be ",
"merged"))
ans <- unlist(ans)
mcols(ans)$revmap <- clusters
ans
}
For your problem ("Merge multiple genomic intervals if overlapping more than xx percents"), it can be used as follow:
## Let's say the genomic intervals to merge are in GRanges object 'gr0'.
strand(gr0) <- "*"
hits <- findOverlaps(gr0)
## Subset 'hits' to keep only hits that achieve 60% overlap.
x <- gr0[queryHits(hits)]
y <- gr0[subjectHits(hits)]
relative_overlap <- width(pintersect(x, y)) / pmin(width(x), width(y))
hits <- hits[relative_overlap >= 0.6]
## Merge the ranges in 'gr0' that are connected via one or more hits in 'hits'.
gr1 <- mergeConnectedRanges(gr0, hits)
gr1
Note that the returned GRanges object has 1 element (i.e. 1 genomic range) per cluster and a metadata column revmap
that maps each cluster to the corresponding elements in gr0
. You can extract this metadata column with mcols(gr1)$revmap
(it's an IntegerList).
Hope this helps,
H.
Thanks Michael!
So the only way to merge overlap ranges from my 4 (or more) granges, is to perform all the pairwise comparison?
My problem is: if I have a range a in A that overlap a range b in B (with more than 60%), and a also overlap (>60%) a range c in C, I want then to merge then all into only 1 ranges... I guess I should compare A and B to get a new granges A', compare A' to C... and so forth. I guess this way there is no chance that I fuse a range c with a range a', when there would not have been a a or b with which I would have merged c?
(of course I can merge c with a' when c overlapping b but not a, but I want that to happen (I think))
It might help if you describe your goal. What are you going to use these merged ranges for? My initial proposal was a progressive reduction, i.e., requiring overlap with the result of the previous merge. An alternative would be to require that each range have 60% overlap with every other range in its "cluster".
Basically I want to make a unique annotation of enriched regions, based on multiple ChIPseq experiment (I get enriched regions for each regions using macs for peak calling).
For now I just merge any overlapping regions from any experiments ; However, I have found that some times, I can find 2 peaks in 2 conditions that are close, overlap a little bit, but for which I know they are distinct functional entity. So I would like to merge those that only have a relatively large overlap.
About the "that each range have 60% overlap with every other range in its "cluster"", this also give artifact, at sometimes a very small peak is called in one condition, but a much bigger one in another ; the small one can be entirely covered by the big one, but he only represented a small fraction of the second ; so they would not be merge.
I think the progressive reduction should work fine :)