find overlapping regions
3
0
Entering edit mode
@mboetzerlumcnl-2807
Last seen 9.6 years ago
Dear list, i have a single region with a start and an end, where start < end. I want to find regions that have an overlap of more than 50% with that region. The regions to compare with are within a dataframe with starts and ends positions: start = 133375983 end = 146245512 data = data.frame(c(133470532, 133966699, 134162735, 134236863, 146225580), c(133754071, 133969713, 134163857, 134249655,156245512)) colnames(data) = c("start2", "end2") > data start2 end2 1 133470532 133754071 2 133966699 133969713 3 134162735 134163857 4 134236863 134249655 5 146225580 156245512 I've already made some code which did the trick, however, when the size of reg1 becomes very large, it will really slow down: regfound = c() reg1 = seq(start, end, 1) for(i in 1:nrow(data)){ eq_reg = sum(is.element(seq(data$start2[i], data$end2[i], 1), reg1)==T) if(eq_reg!=0) regfound = c(regfound, round(eq_reg/((data$end2[i]-data$start2[i])+1)*100,1)) else regfound = c(regfound,F) } >regfound [1] 100.0 100.0 100.0 100.0 0.2 Does anyone know a faster or more elegant way of doing this? Thanks in advance, Marten [[alternative HTML version deleted]]
• 491 views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 3 days ago
United States
Hi Marten -- <m.boetzer at="" lumc.nl=""> writes: > Dear list, > > i have a single region with a start and an end, where start < end. I want to find regions that have an overlap of more than 50% with that region. The regions to compare with are within a dataframe with starts and ends positions: > > start = 133375983 > end = 146245512 > > data = data.frame(c(133470532, 133966699, 134162735, 134236863, 146225580), c(133754071, 133969713, 134163857, 134249655,156245512)) > colnames(data) = c("start2", "end2") > >> data > start2 end2 > 1 133470532 133754071 > 2 133966699 133969713 > 3 134162735 134163857 > 4 134236863 134249655 > 5 146225580 156245512 > > I've already made some code which did the trick, however, when the size of reg1 becomes very large, it will really slow down: > > > regfound = c() > reg1 = seq(start, end, 1) > for(i in 1:nrow(data)){ > eq_reg = sum(is.element(seq(data$start2[i], data$end2[i], 1), reg1)==T) > if(eq_reg!=0) > regfound = c(regfound, round(eq_reg/((data$end2[i]-data$start2[i])+1)*100,1)) > else > regfound = c(regfound,F) > } > >>regfound > [1] 100.0 100.0 100.0 100.0 0.2 Probably the key is to simplify how the overlapping region is found, and then to vectorize the calculation. Maybe something along the lines of > width <- data$end2 - data$start2 > olap <- (pmin(end, data$end2) - pmax(start, data$start2)) / width > olap > .5 [1] TRUE TRUE TRUE TRUE FALSE ? Martin > > Does anyone know a faster or more elegant way of doing this? > > Thanks in advance, > Marten > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
@joern-toedling-1244
Last seen 9.6 years ago
Hi Marten, you may want to have a look at the function regionOverlap in package Ringo, which is not elegant but probably faster since it uses (simple) C code for computing the overlap. Regards, Joern M.Boetzer at lumc.nl wrote: > Dear list, > > i have a single region with a start and an end, where start < end. I want to find regions that have an overlap of more than 50% with that region. The regions to compare with are within a dataframe with starts and ends positions: > > start = 133375983 > end = 146245512 > > data = data.frame(c(133470532, 133966699, 134162735, 134236863, 146225580), c(133754071, 133969713, 134163857, 134249655,156245512)) > colnames(data) = c("start2", "end2") > > >> data >> > start2 end2 > 1 133470532 133754071 > 2 133966699 133969713 > 3 134162735 134163857 > 4 134236863 134249655 > 5 146225580 156245512 > > I've already made some code which did the trick, however, when the size of reg1 becomes very large, it will really slow down: > > > regfound = c() > reg1 = seq(start, end, 1) > for(i in 1:nrow(data)){ > eq_reg = sum(is.element(seq(data$start2[i], data$end2[i], 1), reg1)==T) > if(eq_reg!=0) > regfound = c(regfound, round(eq_reg/((data$end2[i]-data$start2[i])+1)*100,1)) > else > regfound = c(regfound,F) > } > > >> regfound >> > [1] 100.0 100.0 100.0 100.0 0.2 > > Does anyone know a faster or more elegant way of doing this? > > Thanks in advance, > Marten > >
ADD COMMENT
0
Entering edit mode
@mboetzerlumcnl-2807
Last seen 9.6 years ago
Hi Joern, thank you for this function, it works exactly as i want to!!! Cheers, Marten -----Oorspronkelijk bericht----- Van: J.M. Toedling namens Joern Toedling Verzonden: di 20-5-2008 17:45 Aan: Boetzer, M. (HG) CC: bioconductor@stat.math.ethz.ch Onderwerp: Re: [BioC] find overlapping regions Hi Marten, you may want to have a look at the function regionOverlap in package Ringo, which is not elegant but probably faster since it uses (simple) C code for computing the overlap. Regards, Joern M.Boetzer@lumc.nl wrote: > Dear list, > > i have a single region with a start and an end, where start < end. I want to find regions that have an overlap of more than 50% with that region. The regions to compare with are within a dataframe with starts and ends positions: > > start = 133375983 > end = 146245512 > > data = data.frame(c(133470532, 133966699, 134162735, 134236863, 146225580), c(133754071, 133969713, 134163857, 134249655,156245512)) > colnames(data) = c("start2", "end2") > > >> data >> > start2 end2 > 1 133470532 133754071 > 2 133966699 133969713 > 3 134162735 134163857 > 4 134236863 134249655 > 5 146225580 156245512 > > I've already made some code which did the trick, however, when the size of reg1 becomes very large, it will really slow down: > > > regfound = c() > reg1 = seq(start, end, 1) > for(i in 1:nrow(data)){ > eq_reg = sum(is.element(seq(data$start2[i], data$end2[i], 1), reg1)==T) > if(eq_reg!=0) > regfound = c(regfound, round(eq_reg/((data$end2[i]-data$start2[i])+1)*100,1)) > else > regfound = c(regfound,F) > } > > >> regfound >> > [1] 100.0 100.0 100.0 100.0 0.2 > > Does anyone know a faster or more elegant way of doing this? > > Thanks in advance, > Marten > > [[alternative HTML version deleted]]
ADD COMMENT

Login before adding your answer.

Traffic: 814 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6