How to use GenomicRanges to find overlaps by percentage covered of interval
Entering edit mode
khauser130d ▴ 50
Last seen 8.6 years ago
United Kingdom

I have two dataframes one with two (or more) individuals (denoted by sampleID) denoted by TEST and the other consisting of just 3 columns called REF.

I want to add REF onto TEST based upon the overlaps between the "chr","start","end" columns of REFand the corresponding (in order) columns of TEST"Chromosome","Start","End".

I want the definition of overlap to be that in order for a REF interval to overlap a TESTinterval, it must cover >50% of TEST interval. 

I've been playing around with GenomicRanges. However, the findOverlap function only allows me to specify the minOverlap as the number of overlapping positions rather than a fraction of the interval covered. So, using GenomicRanges, how can I do this?

Some example input:

REF = structure(list(chr = c("1", "1", "1", "1", "1", "1", "1", "1", 
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"), 
    start = c(766101L, 1627918L, 4421060L, 7324468L, 8053732L, 
    8182462L, 8182584L, 8182584L, 8206130L, 8804237L, 10369546L, 
    10370541L, 10543836L, 10656324L, 12354307L, 12841928L, 12845863L, 
    12909237L, 12909965L, 13444908L), end = c(809773L, 1672603L, 
    4424115L, 7325408L, 8067990L, 8189854L, 8189285L, 8189285L, 
    8209321L, 8812660L, 10377983L, 10377983L, 10545046L, 10657912L, 
    12357076L, 12971833L, 12883096L, 12927107L, 12918079L, 13468022L
    ), Deft = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1)), .Names = c("chr", "start", "end", "Deft"
), row.names = c(217L, 568L, 1340L, 1691L, 1804L, 1811L, 1812L, 
1813L, 1819L, 1880L, 2017L, 2020L, 2041L, 2049L, 2224L, 2282L, 
2284L, 2332L, 2335L, 2424L), class = "data.frame")

TEST = structure(list(sampleID = c("SID1331", "SID1331", "SID1331", 
"SID1331", "SID1331", "SID1331", "SID1331", "SID1331", "SID1331", 
"SID1331", "SID1337", "SID1337", "SID1337", "SID1337", "SID1337", 
"SID1337", "SID1337", "SID1337", "SID1337", "SID1337", "SID1337"
), Chromosome = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 10L, 
10L, 11L, 11L, 11L, 11L, 11L, 12L, 12L, 12L, 12L), Start = c(61735L, 
7541291L, 32664866L, 45935056L, 55449273L, 117769301L, 117892183L, 
167580307L, 172458743L, 172852403L, 129495492L, 129526092L, 198572L, 
112837740L, 112847206L, 132105712L, 132126171L, 150442L, 23261653L, 
23270278L, 28135989L), End = c(7539746L, 32664843L, 45934562L, 
55445562L, 117767918L, 117886211L, 167580003L, 172457651L, 172851634L, 
214938359L, 129525791L, 135506704L, 112837593L, 112846803L, 132104437L, 
132122974L, 134944770L, 23261612L, 23268885L, 28131521L, 52920414L
)), .Names = c("sampleID", "Chromosome", "Start", "End"), row.names = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1000L, 1001L, 1002L, 1003L, 
1004L, 1005L, 1006L, 1007L, 1008L, 1009L, 1010L), class = "data.frame")
genomicranges • 11k views
Entering edit mode
Last seen 2.6 years ago
United States


Make your GRanges:

refGR <- makeGRangesFromDataFrame(REF)
testGR <- makeGRangesFromDataFrame(TEST)

Find overlaps:

hits <- findOverlaps(refGR, testGR)

Compute percent overlap and filter the hits:

overlaps <- pintersect(refGR[queryHits(hits)], testGR[subjectHits(hits)])
percentOverlap <- width(overlaps) / width(testGR[subjectHits(hits)])
hits <- hits[percentOverlap > 0.5]

Note that none of the sample data passes the filter. It is not clear what you mean by adding REF onto TEST. One example is merging the "Deft" column from REF into TEST:

TEST$Deft <- NA
TEST$Deft[subjectHits(hits)] <- REF$Deft[queryHits(hits)]


Entering edit mode

Would it be possible to post pseudocode? I'm having trouble knowing how to aligh/join the ranges in a way that allows me to use pintersect afterwards

Entering edit mode

I edited the answer to give more detail.

Entering edit mode

Perfect!, thank you!


Login before adding your answer.

Traffic: 582 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6