Select from Overlapping Genomic Ranges based on Score or other Metadata
1
1
Entering edit mode
IV ▴ 30
@iv-7340
Last seen 5.1 years ago
USA

I have a series of genomic ranges accompanied with metadata.

For example, the simplest case is:

Chromosome Start End Score

chr1 1234567 1234987 15

chr1 1234577 1234999 25

chr1 1234560 1234940 50

chr2 1234567 1234987 15

chr2 1234577 1234999 25

chr2 1234560 1234940 50

I would like to be able to select only one for each overlapping batch based on a condition on the metadata (e.g. from those that overlap keep the one with the max score, etc).

In the above example, the dataset should be reduced to:

chr1 1234560 1234940 50

chr2 1234560 1234940 50

.. since the first 3 overlap and the third is the one with the highest score, while the last three also overlap and the 6th has the highest score.

Any ideas?

 

genomicranges • 1.2k views
ADD COMMENT
3
Entering edit mode
@michael-lawrence-3846
Last seen 2.4 years ago
United States

This is yet another good example of the need for some sort of connected-component-based reduction of cluters of overlapping ranges. The basic idea is to find the self-overlaps, form a graph from the hits, and find the connected components in the graph:

setAs("Hits", "graph", function(from) {
          stopifnot(queryLength(from) == subjectLength(from))
          f <- factor(queryHits(from), seq_len(queryLength(from)))
          NEL <- split(subjectHits(from), f)
          graph::graphNEL(names(NEL), NEL)
      })

hits <- findOverlaps(gr)
g <- as(hits, "graph")
comps <- unname(RBGL::connectedComp(g))

Once you have the components, then aggregation is fairly straightforward. For example,

scoreByComp <- extractList(score(gr), comps)
sel <- phead(which(scoreComp == max(scoreByComp)), 1)
unlist(relist(gr, scoreByComp)[sel])

Using which.max() would be more convenient there but it is not implemented efficiently for compressed lists. Will change that in devel today. Then it would be something like:

m <- which.max(scoreByComp)
unlist(relist(gr, scoreByComp)[as(m, "IntegerList")])

Perhaps a more straight-forward version of the last line:

gr[m + start(PartitioningByEnd(scoreByComp)) - 1L]

Is there some way we can make this easier for users? Both the clustering and subscripting with the which.max() result?

 

ADD COMMENT

Login before adding your answer.

Traffic: 752 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6