Search
Question: Select from Overlapping Genomic Ranges based on Score or other Metadata
1
2.7 years ago by
IV20
USA
IV20 wrote:

I have a series of genomic ranges accompanied with metadata.

For example, the simplest case is:

Chromosome Start End Score

chr1 1234567 1234987 15

chr1 1234577 1234999 25

chr1 1234560 1234940 50

chr2 1234567 1234987 15

chr2 1234577 1234999 25

chr2 1234560 1234940 50

I would like to be able to select only one for each overlapping batch based on a condition on the metadata (e.g. from those that overlap keep the one with the max score, etc).

In the above example, the dataset should be reduced to:

chr1 1234560 1234940 50

chr2 1234560 1234940 50

.. since the first 3 overlap and the third is the one with the highest score, while the last three also overlap and the 6th has the highest score.

Any ideas?

ADD COMMENTlink
modified 2.6 years ago by Michael Lawrence10k • written 2.7 years ago by IV20
3
2.6 years ago by
United States
Michael Lawrence10k wrote:

This is yet another good example of the need for some sort of connected-component-based reduction of cluters of overlapping ranges. The basic idea is to find the self-overlaps, form a graph from the hits, and find the connected components in the graph:

setAs("Hits", "graph", function(from) {
stopifnot(queryLength(from) == subjectLength(from))
f <- factor(queryHits(from), seq_len(queryLength(from)))
NEL <- split(subjectHits(from), f)
graph::graphNEL(names(NEL), NEL)
})

hits <- findOverlaps(gr)
g <- as(hits, "graph")
comps <- unname(RBGL::connectedComp(g))


Once you have the components, then aggregation is fairly straightforward. For example,

scoreByComp <- extractList(score(gr), comps)
sel <- phead(which(scoreComp == max(scoreByComp)), 1)
unlist(relist(gr, scoreByComp)[sel])

Using which.max() would be more convenient there but it is not implemented efficiently for compressed lists. Will change that in devel today. Then it would be something like:

m <- which.max(scoreByComp)
unlist(relist(gr, scoreByComp)[as(m, "IntegerList")])

Perhaps a more straight-forward version of the last line:

gr[m + start(PartitioningByEnd(scoreByComp)) - 1L]

Is there some way we can make this easier for users? Both the clustering and subscripting with the which.max() result?

ADD COMMENTlink written 2.6 years ago by Michael Lawrence10k
Please log in to add an answer.

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 273 users visited in the last hour