How to use GenomicRanges to find overlaps by percentage covered of interval
1
2
Entering edit mode
khauser130d ▴ 30
@khauser130d-8880
Last seen 5.6 years ago
United Kingdom

I have two dataframes one with two (or more) individuals (denoted by sampleID) denoted by TEST and the other consisting of just 3 columns called REF.

I want to add REF onto TEST based upon the overlaps between the "chr","start","end" columns of REFand the corresponding (in order) columns of TEST"Chromosome","Start","End".

I want the definition of overlap to be that in order for a REF interval to overlap a TESTinterval, it must cover >50% of TEST interval. 

I've been playing around with GenomicRanges. However, the findOverlap function only allows me to specify the minOverlap as the number of overlapping positions rather than a fraction of the interval covered. So, using GenomicRanges, how can I do this?

Some example input:

REF = structure(list(chr = c("1", "1", "1", "1", "1", "1", "1", "1", 
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"), 
    start = c(766101L, 1627918L, 4421060L, 7324468L, 8053732L, 
    8182462L, 8182584L, 8182584L, 8206130L, 8804237L, 10369546L, 
    10370541L, 10543836L, 10656324L, 12354307L, 12841928L, 12845863L, 
    12909237L, 12909965L, 13444908L), end = c(809773L, 1672603L, 
    4424115L, 7325408L, 8067990L, 8189854L, 8189285L, 8189285L, 
    8209321L, 8812660L, 10377983L, 10377983L, 10545046L, 10657912L, 
    12357076L, 12971833L, 12883096L, 12927107L, 12918079L, 13468022L
    ), Deft = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1)), .Names = c("chr", "start", "end", "Deft"
), row.names = c(217L, 568L, 1340L, 1691L, 1804L, 1811L, 1812L, 
1813L, 1819L, 1880L, 2017L, 2020L, 2041L, 2049L, 2224L, 2282L, 
2284L, 2332L, 2335L, 2424L), class = "data.frame")

TEST = structure(list(sampleID = c("SID1331", "SID1331", "SID1331", 
"SID1331", "SID1331", "SID1331", "SID1331", "SID1331", "SID1331", 
"SID1331", "SID1337", "SID1337", "SID1337", "SID1337", "SID1337", 
"SID1337", "SID1337", "SID1337", "SID1337", "SID1337", "SID1337"
), Chromosome = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 10L, 
10L, 11L, 11L, 11L, 11L, 11L, 12L, 12L, 12L, 12L), Start = c(61735L, 
7541291L, 32664866L, 45935056L, 55449273L, 117769301L, 117892183L, 
167580307L, 172458743L, 172852403L, 129495492L, 129526092L, 198572L, 
112837740L, 112847206L, 132105712L, 132126171L, 150442L, 23261653L, 
23270278L, 28135989L), End = c(7539746L, 32664843L, 45934562L, 
55445562L, 117767918L, 117886211L, 167580003L, 172457651L, 172851634L, 
214938359L, 129525791L, 135506704L, 112837593L, 112846803L, 132104437L, 
132122974L, 134944770L, 23261612L, 23268885L, 28131521L, 52920414L
)), .Names = c("sampleID", "Chromosome", "Start", "End"), row.names = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1000L, 1001L, 1002L, 1003L, 
1004L, 1005L, 1006L, 1007L, 1008L, 1009L, 1010L), class = "data.frame")
genomicranges • 7.5k views
ADD COMMENT
7
Entering edit mode
@michael-lawrence-3846
Last seen 1 day ago
United States

 

Make your GRanges:

refGR <- makeGRangesFromDataFrame(REF)
testGR <- makeGRangesFromDataFrame(TEST)

Find overlaps:

hits <- findOverlaps(refGR, testGR)

Compute percent overlap and filter the hits:

overlaps <- pintersect(refGR[queryHits(hits)], testGR[subjectHits(hits)])
percentOverlap <- width(overlaps) / width(testGR[subjectHits(hits)])
hits <- hits[percentOverlap > 0.5]

Note that none of the sample data passes the filter. It is not clear what you mean by adding REF onto TEST. One example is merging the "Deft" column from REF into TEST:

TEST$Deft <- NA
TEST$Deft[subjectHits(hits)] <- REF$Deft[queryHits(hits)]

 

ADD COMMENT
1
Entering edit mode

Would it be possible to post pseudocode? I'm having trouble knowing how to aligh/join the ranges in a way that allows me to use pintersect afterwards

ADD REPLY
1
Entering edit mode

I edited the answer to give more detail.

ADD REPLY
0
Entering edit mode

Perfect!, thank you!

ADD REPLY

Login before adding your answer.

Traffic: 481 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6