Search
Question: Select from Overlapping Genomic Ranges based on Score or other Metadata
1
gravatar for IV
21 months ago by
IV20
USA
IV20 wrote:

I have a series of genomic ranges accompanied with metadata.

For example, the simplest case is:

Chromosome Start End Score

chr1 1234567 1234987 15

chr1 1234577 1234999 25

chr1 1234560 1234940 50

chr2 1234567 1234987 15

chr2 1234577 1234999 25

chr2 1234560 1234940 50

I would like to be able to select only one for each overlapping batch based on a condition on the metadata (e.g. from those that overlap keep the one with the max score, etc).

In the above example, the dataset should be reduced to:

chr1 1234560 1234940 50

chr2 1234560 1234940 50

.. since the first 3 overlap and the third is the one with the highest score, while the last three also overlap and the 6th has the highest score.

Any ideas?

 

ADD COMMENTlink modified 21 months ago by Michael Lawrence9.8k • written 21 months ago by IV20
3
gravatar for Michael Lawrence
21 months ago by
United States
Michael Lawrence9.8k wrote:

This is yet another good example of the need for some sort of connected-component-based reduction of cluters of overlapping ranges. The basic idea is to find the self-overlaps, form a graph from the hits, and find the connected components in the graph:

setAs("Hits", "graph", function(from) {
          stopifnot(queryLength(from) == subjectLength(from))
          f <- factor(queryHits(from), seq_len(queryLength(from)))
          NEL <- split(subjectHits(from), f)
          graph::graphNEL(names(NEL), NEL)
      })

hits <- findOverlaps(gr)
g <- as(hits, "graph")
comps <- unname(RBGL::connectedComp(g))

Once you have the components, then aggregation is fairly straightforward. For example,

scoreByComp <- extractList(score(gr), comps)
sel <- phead(which(scoreComp == max(scoreByComp)), 1)
unlist(relist(gr, scoreByComp)[sel])

Using which.max() would be more convenient there but it is not implemented efficiently for compressed lists. Will change that in devel today. Then it would be something like:

m <- which.max(scoreByComp)
unlist(relist(gr, scoreByComp)[as(m, "IntegerList")])

Perhaps a more straight-forward version of the last line:

gr[m + start(PartitioningByEnd(scoreByComp)) - 1L]

Is there some way we can make this easier for users? Both the clustering and subscripting with the which.max() result?

 

ADD COMMENTlink written 21 months ago by Michael Lawrence9.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 403 users visited in the last hour