Question: How to use GenomicRanges to find overlaps by percentage covered of interval
gravatar for khauser130d
4.1 years ago by
United Kingdom
khauser130d20 wrote:

I have two dataframes one with two (or more) individuals (denoted by sampleID) denoted by TEST and the other consisting of just 3 columns called REF.

I want to add REF onto TEST based upon the overlaps between the "chr","start","end" columns of REFand the corresponding (in order) columns of TEST"Chromosome","Start","End".

I want the definition of overlap to be that in order for a REF interval to overlap a TESTinterval, it must cover >50% of TEST interval. 

I've been playing around with GenomicRanges. However, the findOverlap function only allows me to specify the minOverlap as the number of overlapping positions rather than a fraction of the interval covered. So, using GenomicRanges, how can I do this?

Some example input:

REF = structure(list(chr = c("1", "1", "1", "1", "1", "1", "1", "1", 
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"), 
    start = c(766101L, 1627918L, 4421060L, 7324468L, 8053732L, 
    8182462L, 8182584L, 8182584L, 8206130L, 8804237L, 10369546L, 
    10370541L, 10543836L, 10656324L, 12354307L, 12841928L, 12845863L, 
    12909237L, 12909965L, 13444908L), end = c(809773L, 1672603L, 
    4424115L, 7325408L, 8067990L, 8189854L, 8189285L, 8189285L, 
    8209321L, 8812660L, 10377983L, 10377983L, 10545046L, 10657912L, 
    12357076L, 12971833L, 12883096L, 12927107L, 12918079L, 13468022L
    ), Deft = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1)), .Names = c("chr", "start", "end", "Deft"
), row.names = c(217L, 568L, 1340L, 1691L, 1804L, 1811L, 1812L, 
1813L, 1819L, 1880L, 2017L, 2020L, 2041L, 2049L, 2224L, 2282L, 
2284L, 2332L, 2335L, 2424L), class = "data.frame")

TEST = structure(list(sampleID = c("SID1331", "SID1331", "SID1331", 
"SID1331", "SID1331", "SID1331", "SID1331", "SID1331", "SID1331", 
"SID1331", "SID1337", "SID1337", "SID1337", "SID1337", "SID1337", 
"SID1337", "SID1337", "SID1337", "SID1337", "SID1337", "SID1337"
), Chromosome = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 10L, 
10L, 11L, 11L, 11L, 11L, 11L, 12L, 12L, 12L, 12L), Start = c(61735L, 
7541291L, 32664866L, 45935056L, 55449273L, 117769301L, 117892183L, 
167580307L, 172458743L, 172852403L, 129495492L, 129526092L, 198572L, 
112837740L, 112847206L, 132105712L, 132126171L, 150442L, 23261653L, 
23270278L, 28135989L), End = c(7539746L, 32664843L, 45934562L, 
55445562L, 117767918L, 117886211L, 167580003L, 172457651L, 172851634L, 
214938359L, 129525791L, 135506704L, 112837593L, 112846803L, 132104437L, 
132122974L, 134944770L, 23261612L, 23268885L, 28131521L, 52920414L
)), .Names = c("sampleID", "Chromosome", "Start", "End"), row.names = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1000L, 1001L, 1002L, 1003L, 
1004L, 1005L, 1006L, 1007L, 1008L, 1009L, 1010L), class = "data.frame")
genomicranges • 5.4k views
ADD COMMENTlink modified 4.1 years ago by Michael Lawrence11k • written 4.1 years ago by khauser130d20
Answer: How to use GenomicRanges to find overlaps by percentage covered of interval
gravatar for Michael Lawrence
4.1 years ago by
United States
Michael Lawrence11k wrote:


Make your GRanges:

refGR <- makeGRangesFromDataFrame(REF)
testGR <- makeGRangesFromDataFrame(TEST)

Find overlaps:

hits <- findOverlaps(refGR, testGR)

Compute percent overlap and filter the hits:

overlaps <- pintersect(refGR[queryHits(hits)], testGR[subjectHits(hits)])
percentOverlap <- width(overlaps) / width(testGR[subjectHits(hits)])
hits <- hits[percentOverlap > 0.5]

Note that none of the sample data passes the filter. It is not clear what you mean by adding REF onto TEST. One example is merging the "Deft" column from REF into TEST:

TEST$Deft <- NA
TEST$Deft[subjectHits(hits)] <- REF$Deft[queryHits(hits)]


ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by Michael Lawrence11k

Would it be possible to post pseudocode? I'm having trouble knowing how to aligh/join the ranges in a way that allows me to use pintersect afterwards

ADD REPLYlink written 4.1 years ago by khauser130d20

I edited the answer to give more detail.

ADD REPLYlink written 4.1 years ago by Michael Lawrence11k

Perfect!, thank you!

ADD REPLYlink written 8 months ago by eegonzalezk0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 88 users visited in the last hour