find overlapping regions

0

Entering edit mode

M.Boetzer@lumc.nl ▴ 20

@mboetzerlumcnl-2807

Last seen 9.6 years ago

Dear list, i have a single region with a start and an end, where start < end. I want to find regions that have an overlap of more than 50% with that region. The regions to compare with are within a dataframe with starts and ends positions: start = 133375983 end = 146245512 data = data.frame(c(133470532, 133966699, 134162735, 134236863, 146225580), c(133754071, 133969713, 134163857, 134249655,156245512)) colnames(data) = c("start2", "end2") > data start2 end2 1 133470532 133754071 2 133966699 133969713 3 134162735 134163857 4 134236863 134249655 5 146225580 156245512 I've already made some code which did the trick, however, when the size of reg1 becomes very large, it will really slow down: regfound = c() reg1 = seq(start, end, 1) for(i in 1:nrow(data)){ eq_reg = sum(is.element(seq(data$start2[i], data$end2[i], 1), reg1)==T) if(eq_reg!=0) regfound = c(regfound, round(eq_reg/((data$end2[i]-data$start2[i])+1)*100,1)) else regfound = c(regfound,F) } >regfound [1] 100.0 100.0 100.0 100.0 0.2 Does anyone know a faster or more elegant way of doing this? Thanks in advance, Marten [[alternative HTML version deleted]]

• 491 views

ADD COMMENT • link 15.9 years ago M.Boetzer@lumc.nl ▴ 20

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 3 days ago

United States

Hi Marten -- <m.boetzer at="" lumc.nl=""> writes: > Dear list, > > i have a single region with a start and an end, where start < end. I want to find regions that have an overlap of more than 50% with that region. The regions to compare with are within a dataframe with starts and ends positions: > > start = 133375983 > end = 146245512 > > data = data.frame(c(133470532, 133966699, 134162735, 134236863, 146225580), c(133754071, 133969713, 134163857, 134249655,156245512)) > colnames(data) = c("start2", "end2") > >> data > start2 end2 > 1 133470532 133754071 > 2 133966699 133969713 > 3 134162735 134163857 > 4 134236863 134249655 > 5 146225580 156245512 > > I've already made some code which did the trick, however, when the size of reg1 becomes very large, it will really slow down: > > > regfound = c() > reg1 = seq(start, end, 1) > for(i in 1:nrow(data)){ > eq_reg = sum(is.element(seq(data$start2[i], data$end2[i], 1), reg1)==T) > if(eq_reg!=0) > regfound = c(regfound, round(eq_reg/((data$end2[i]-data$start2[i])+1)*100,1)) > else > regfound = c(regfound,F) > } > >>regfound > [1] 100.0 100.0 100.0 100.0 0.2 Probably the key is to simplify how the overlapping region is found, and then to vectorize the calculation. Maybe something along the lines of > width <- data$end2 - data$start2 > olap <- (pmin(end, data$end2) - pmax(start, data$start2)) / width > olap > .5 [1] TRUE TRUE TRUE TRUE FALSE ? Martin > > Does anyone know a faster or more elegant way of doing this? > > Thanks in advance, > Marten > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793

ADD COMMENT • link 15.9 years ago Martin Morgan 25k

0

Entering edit mode

Joern Toedling ▴ 730

@joern-toedling-1244

Last seen 9.6 years ago

Hi Marten, you may want to have a look at the function regionOverlap in package Ringo, which is not elegant but probably faster since it uses (simple) C code for computing the overlap. Regards, Joern M.Boetzer at lumc.nl wrote: > Dear list, > > i have a single region with a start and an end, where start < end. I want to find regions that have an overlap of more than 50% with that region. The regions to compare with are within a dataframe with starts and ends positions: > > start = 133375983 > end = 146245512 > > data = data.frame(c(133470532, 133966699, 134162735, 134236863, 146225580), c(133754071, 133969713, 134163857, 134249655,156245512)) > colnames(data) = c("start2", "end2") > > >> data >> > start2 end2 > 1 133470532 133754071 > 2 133966699 133969713 > 3 134162735 134163857 > 4 134236863 134249655 > 5 146225580 156245512 > > I've already made some code which did the trick, however, when the size of reg1 becomes very large, it will really slow down: > > > regfound = c() > reg1 = seq(start, end, 1) > for(i in 1:nrow(data)){ > eq_reg = sum(is.element(seq(data$start2[i], data$end2[i], 1), reg1)==T) > if(eq_reg!=0) > regfound = c(regfound, round(eq_reg/((data$end2[i]-data$start2[i])+1)*100,1)) > else > regfound = c(regfound,F) > } > > >> regfound >> > [1] 100.0 100.0 100.0 100.0 0.2 > > Does anyone know a faster or more elegant way of doing this? > > Thanks in advance, > Marten > >

ADD COMMENT • link 15.9 years ago Joern Toedling ▴ 730

0

Entering edit mode

M.Boetzer@lumc.nl ▴ 20

@mboetzerlumcnl-2807

Last seen 9.6 years ago

Hi Joern, thank you for this function, it works exactly as i want to!!! Cheers, Marten -----Oorspronkelijk bericht----- Van: J.M. Toedling namens Joern Toedling Verzonden: di 20-5-2008 17:45 Aan: Boetzer, M. (HG) CC: bioconductor@stat.math.ethz.ch Onderwerp: Re: [BioC] find overlapping regions Hi Marten, you may want to have a look at the function regionOverlap in package Ringo, which is not elegant but probably faster since it uses (simple) C code for computing the overlap. Regards, Joern M.Boetzer@lumc.nl wrote: > Dear list, > > i have a single region with a start and an end, where start < end. I want to find regions that have an overlap of more than 50% with that region. The regions to compare with are within a dataframe with starts and ends positions: > > start = 133375983 > end = 146245512 > > data = data.frame(c(133470532, 133966699, 134162735, 134236863, 146225580), c(133754071, 133969713, 134163857, 134249655,156245512)) > colnames(data) = c("start2", "end2") > > >> data >> > start2 end2 > 1 133470532 133754071 > 2 133966699 133969713 > 3 134162735 134163857 > 4 134236863 134249655 > 5 146225580 156245512 > > I've already made some code which did the trick, however, when the size of reg1 becomes very large, it will really slow down: > > > regfound = c() > reg1 = seq(start, end, 1) > for(i in 1:nrow(data)){ > eq_reg = sum(is.element(seq(data$start2[i], data$end2[i], 1), reg1)==T) > if(eq_reg!=0) > regfound = c(regfound, round(eq_reg/((data$end2[i]-data$start2[i])+1)*100,1)) > else > regfound = c(regfound,F) > } > > >> regfound >> > [1] 100.0 100.0 100.0 100.0 0.2 > > Does anyone know a faster or more elegant way of doing this? > > Thanks in advance, > Marten > > [[alternative HTML version deleted]]

ADD COMMENT • link 15.9 years ago M.Boetzer@lumc.nl ▴ 20

Login before adding your answer.