Select from Overlapping Genomic Ranges based on Score or other Metadata
Entering edit mode
IV ▴ 30
Last seen 2.6 years ago

I have a series of genomic ranges accompanied with metadata.

For example, the simplest case is:

Chromosome Start End Score

chr1 1234567 1234987 15

chr1 1234577 1234999 25

chr1 1234560 1234940 50

chr2 1234567 1234987 15

chr2 1234577 1234999 25

chr2 1234560 1234940 50

I would like to be able to select only one for each overlapping batch based on a condition on the metadata (e.g. from those that overlap keep the one with the max score, etc).

In the above example, the dataset should be reduced to:

chr1 1234560 1234940 50

chr2 1234560 1234940 50

.. since the first 3 overlap and the third is the one with the highest score, while the last three also overlap and the 6th has the highest score.

Any ideas?


genomicranges • 812 views
Entering edit mode
Last seen 19 days ago
United States

This is yet another good example of the need for some sort of connected-component-based reduction of cluters of overlapping ranges. The basic idea is to find the self-overlaps, form a graph from the hits, and find the connected components in the graph:

setAs("Hits", "graph", function(from) {
          stopifnot(queryLength(from) == subjectLength(from))
          f <- factor(queryHits(from), seq_len(queryLength(from)))
          NEL <- split(subjectHits(from), f)
          graph::graphNEL(names(NEL), NEL)

hits <- findOverlaps(gr)
g <- as(hits, "graph")
comps <- unname(RBGL::connectedComp(g))

Once you have the components, then aggregation is fairly straightforward. For example,

scoreByComp <- extractList(score(gr), comps)
sel <- phead(which(scoreComp == max(scoreByComp)), 1)
unlist(relist(gr, scoreByComp)[sel])

Using which.max() would be more convenient there but it is not implemented efficiently for compressed lists. Will change that in devel today. Then it would be something like:

m <- which.max(scoreByComp)
unlist(relist(gr, scoreByComp)[as(m, "IntegerList")])

Perhaps a more straight-forward version of the last line:

gr[m + start(PartitioningByEnd(scoreByComp)) - 1L]

Is there some way we can make this easier for users? Both the clustering and subscripting with the which.max() result?



Login before adding your answer.

Traffic: 313 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6