Question

Select from Overlapping Genomic Ranges based on Score or other Metadata

1

Entering edit mode

IV ▴ 30

@iv-7340

Last seen 5.7 years ago

USA

I have a series of genomic ranges accompanied with metadata.

For example, the simplest case is:

Chromosome Start End Score

chr1 1234567 1234987 15

chr1 1234577 1234999 25

chr1 1234560 1234940 50

chr2 1234567 1234987 15

chr2 1234577 1234999 25

chr2 1234560 1234940 50

I would like to be able to select only one for each overlapping batch based on a condition on the metadata (e.g. from those that overlap keep the one with the max score, etc).

In the above example, the dataset should be reduced to:

chr1 1234560 1234940 50

chr2 1234560 1234940 50

.. since the first 3 overlap and the third is the one with the highest score, while the last three also overlap and the 6th has the highest score.

Any ideas?

genomicranges • 1.3k views

ADD COMMENT • link updated 8.8 years ago by Michael Lawrence ★ 11k • written 8.8 years ago by IV ▴ 30

score 3 · Accepted Answer · 2016-02-22

This is yet another good example of the need for some sort of connected-component-based reduction of cluters of overlapping ranges. The basic idea is to find the self-overlaps, form a graph from the hits, and find the connected components in the graph:

setAs("Hits", "graph", function(from) {
          stopifnot(queryLength(from) == subjectLength(from))
          f <- factor(queryHits(from), seq_len(queryLength(from)))
          NEL <- split(subjectHits(from), f)
          graph::graphNEL(names(NEL), NEL)
      })

hits <- findOverlaps(gr)
g <- as(hits, "graph")
comps <- unname(RBGL::connectedComp(g))

Once you have the components, then aggregation is fairly straightforward. For example,

scoreByComp <- extractList(score(gr), comps)
sel <- phead(which(scoreComp == max(scoreByComp)), 1)
unlist(relist(gr, scoreByComp)[sel])

Using which.max() would be more convenient there but it is not implemented efficiently for compressed lists. Will change that in devel today. Then it would be something like:

m <- which.max(scoreByComp)
unlist(relist(gr, scoreByComp)[as(m, "IntegerList")])

Perhaps a more straight-forward version of the last line:

gr[m + start(PartitioningByEnd(scoreByComp)) - 1L]

Is there some way we can make this easier for users? Both the clustering and subscripting with the which.max() result?