IRanges: Need help speeding up sliding window analysis
1
0
Entering edit mode
@michael-dondrup-3849
Last seen 9.6 years ago
Hi, I need some help with speeding up a sliding window analysis on an Rle object of length > 1 million. I am using functions 'successiveViews' with negative gap width and 'viewApply' vs. viewMeans. My goal is to apply a discrete differential operator that computes the difference between the 'left' half of the window and the 'right', aka. a cheap discrete numeric first order differentiation. What I found is: viewMeans(x) << viewApply(x, mean) << viewApply(x, diff.op) in terms of time, example below. Is there a way to pimp this code to make it work on the genome scale? I appreciate your input, I am confident there is a better way to do it. Thank you very much Michael Code example: diff.op <- function(x, lrprop=1/2) { len = length(x) i = ceiling(len*lrprop) (sum(x[i:len]) - sum(x[1:i])) / len } sliding.window.apply <- function(object, width, fun, ...) { x <- trim(successiveViews(subject=object, width=rep(width, ceiling(length(object)) ), gap=-width+1)) return (Rle(viewApply(x, fun, ...))) } sliding.window.mean <- function(object, width) { x <- trim(successiveViews(subject=object, width=rep(width, ceiling(length(object)) ), gap=-width+1)) return (viewMeans(x)) } rle <- Rle(1:10000) > system.time(sliding.window.mean(rle, 30)) user system elapsed 0.036 0.004 0.098 > system.time(sliding.window.apply(rle, 30, mean)) user system elapsed 4.380 0.065 6.010 > system.time(sliding.window.apply(rle, 30, diff.op)) user system elapsed 38.857 0.204 39.127 > sessionInfo() R version 2.11.1 (2010-05-31) x86_64-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] IRanges_1.6.6 loaded via a namespace (and not attached): [1] annotate_1.26.1 AnnotationDbi_1.10.2 Biobase_2.8.0 DBI_0.2-5 DESeq_1.0.4 [6] genefilter_1.30.0 geneplotter_1.26.0 grid_2.11.1 RColorBrewer_1.0-2 RSQLite_0.9-1 [11] splines_2.11.1 survival_2.35-8 xtable_1.5-6
• 1.7k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 5 days ago
United States
On 09/24/2010 03:11 AM, Michael Dondrup wrote: > Hi, > > I need some help with speeding up a sliding window analysis on an Rle object of length > 1 million. > I am using functions 'successiveViews' with negative gap width and 'viewApply' vs. viewMeans. > My goal is to apply a discrete differential operator that computes the difference between the 'left' > half of the window and the 'right', aka. a cheap discrete numeric first order differentiation. > > What I found is: viewMeans(x) << viewApply(x, mean) << viewApply(x, diff.op) in terms of time, example below. > Is there a way to pimp this code to make it work on the genome scale? I appreciate your input, I am confident there > is a better way to do it. maybe win <- 30 diff(cumsum(rle), win) / win for numeric (not integer) rle, though there might be rounding problems if cumsum gets large. A strategy might be to break the Rle into regions separated by islands of at least 'win' 0's (using runLength / runValue to identify candidate break points), which allows one to reset the cumsum. Some inspiration might come from http://www.mail-archive.com/r-help at r-project.org/msg75280.html Also the end points might need fiddling (e.g., by padding rle with 'win' trailing zeros, which is I think in effect what successiveViews does. Martin > > Thank you very much > Michael > > Code example: > > diff.op <- function(x, lrprop=1/2) { > len = length(x) > i = ceiling(len*lrprop) > (sum(x[i:len]) - sum(x[1:i])) / len > } > > sliding.window.apply <- function(object, width, fun, ...) { > x <- trim(successiveViews(subject=object, width=rep(width, ceiling(length(object)) ), gap=-width+1)) > return (Rle(viewApply(x, fun, ...))) > } > > sliding.window.mean <- function(object, width) { > x <- trim(successiveViews(subject=object, width=rep(width, ceiling(length(object)) ), gap=-width+1)) > return (viewMeans(x)) > } > > rle <- Rle(1:10000) > >> system.time(sliding.window.mean(rle, 30)) > user system elapsed > 0.036 0.004 0.098 >> system.time(sliding.window.apply(rle, 30, mean)) > user system elapsed > 4.380 0.065 6.010 >> system.time(sliding.window.apply(rle, 30, diff.op)) > user system elapsed > 38.857 0.204 39.127 > >> sessionInfo() > R version 2.11.1 (2010-05-31) > x86_64-apple-darwin9.8.0 > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] IRanges_1.6.6 > > loaded via a namespace (and not attached): > [1] annotate_1.26.1 AnnotationDbi_1.10.2 Biobase_2.8.0 DBI_0.2-5 DESeq_1.0.4 > [6] genefilter_1.30.0 geneplotter_1.26.0 grid_2.11.1 RColorBrewer_1.0-2 RSQLite_0.9-1 > [11] splines_2.11.1 survival_2.35-8 xtable_1.5-6 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
Michael, The IRanges package contains a number of built-in running window functions (runsum, runmean, runwtsum, runq) that you might want to consider for this operation. Also, if I am understanding what you are trying to do correctly, something like width <- 30 halfWidth <- width/2 halfWidthSums <- runsum(rle, halfWidth) diff(halfWidthSums, halfWidth)/width should fit the bill. Cheers, Patrick Quoting Martin Morgan <mtmorgan at="" fhcrc.org="">: > On 09/24/2010 03:11 AM, Michael Dondrup wrote: >> Hi, >> >> I need some help with speeding up a sliding window analysis on an >> Rle object of length > 1 million. >> I am using functions 'successiveViews' with negative gap width and >> 'viewApply' vs. viewMeans. >> My goal is to apply a discrete differential operator that computes >> the difference between the 'left' >> half of the window and the 'right', aka. a cheap discrete numeric >> first order differentiation. >> >> What I found is: viewMeans(x) << viewApply(x, mean) << viewApply(x, >> diff.op) in terms of time, example below. >> Is there a way to pimp this code to make it work on the genome >> scale? I appreciate your input, I am confident there >> is a better way to do it. > > maybe > > win <- 30 > diff(cumsum(rle), win) / win > > for numeric (not integer) rle, though there might be rounding problems > if cumsum gets large. A strategy might be to break the Rle into regions > separated by islands of at least 'win' 0's (using runLength / runValue > to identify candidate break points), which allows one to reset the > cumsum. Some inspiration might come from > http://www.mail-archive.com/r-help at r-project.org/msg75280.html > > Also the end points might need fiddling (e.g., by padding rle with 'win' > trailing zeros, which is I think in effect what successiveViews does. > > Martin > >> >> Thank you very much >> Michael >> >> Code example: >> >> diff.op <- function(x, lrprop=1/2) { >> len = length(x) >> i = ceiling(len*lrprop) >> (sum(x[i:len]) - sum(x[1:i])) / len >> } >> >> sliding.window.apply <- function(object, width, fun, ...) { >> x <- trim(successiveViews(subject=object, width=rep(width, >> ceiling(length(object)) ), gap=-width+1)) >> return (Rle(viewApply(x, fun, ...))) >> } >> >> sliding.window.mean <- function(object, width) { >> x <- trim(successiveViews(subject=object, width=rep(width, >> ceiling(length(object)) ), gap=-width+1)) >> return (viewMeans(x)) >> } >> >> rle <- Rle(1:10000) >> >>> system.time(sliding.window.mean(rle, 30)) >> user system elapsed >> 0.036 0.004 0.098 >>> system.time(sliding.window.apply(rle, 30, mean)) >> user system elapsed >> 4.380 0.065 6.010 >>> system.time(sliding.window.apply(rle, 30, diff.op)) >> user system elapsed >> 38.857 0.204 39.127 >> >>> sessionInfo() >> R version 2.11.1 (2010-05-31) >> x86_64-apple-darwin9.8.0 >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] IRanges_1.6.6 >> >> loaded via a namespace (and not attached): >> [1] annotate_1.26.1 AnnotationDbi_1.10.2 Biobase_2.8.0 >> DBI_0.2-5 DESeq_1.0.4 >> [6] genefilter_1.30.0 geneplotter_1.26.0 grid_2.11.1 >> RColorBrewer_1.0-2 RSQLite_0.9-1 >> [11] splines_2.11.1 survival_2.35-8 xtable_1.5-6 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
Hi Patrick, just wanted to say thank you, that was exactly what I needed and it's much faster and more robust than my initial solution. It can be used f.e. to detect locations of strong changes in a numeric Rle vector, eg. a coverage, I didn't know that 'diff' could be used on Rle's like this. Best Michael Am Sep 24, 2010 um 5:51 PM schrieb Patrick Aboyoun: > Michael, > The IRanges package contains a number of built-in running window functions (runsum, runmean, runwtsum, runq) that you might want to consider for this operation. Also, if I am understanding what you are trying to do correctly, something like > > width <- 30 > halfWidth <- width/2 > halfWidthSums <- runsum(rle, halfWidth) > diff(halfWidthSums, halfWidth)/width > > should fit the bill. > > > Cheers, > Patrick > > > Quoting Martin Morgan <mtmorgan at="" fhcrc.org="">: > >> On 09/24/2010 03:11 AM, Michael Dondrup wrote: >>> Hi, >>> >>> I need some help with speeding up a sliding window analysis on an Rle object of length > 1 million. >>> I am using functions 'successiveViews' with negative gap width and 'viewApply' vs. viewMeans. >>> My goal is to apply a discrete differential operator that computes the difference between the 'left' >>> half of the window and the 'right', aka. a cheap discrete numeric first order differentiation. >>> >>> What I found is: viewMeans(x) << viewApply(x, mean) << viewApply(x, diff.op) in terms of time, example below. >>> Is there a way to pimp this code to make it work on the genome scale? I appreciate your input, I am confident there >>> is a better way to do it. >> >> maybe >> >> win <- 30 >> diff(cumsum(rle), win) / win >> >> for numeric (not integer) rle, though there might be rounding problems >> if cumsum gets large. A strategy might be to break the Rle into regions >> separated by islands of at least 'win' 0's (using runLength / runValue >> to identify candidate break points), which allows one to reset the >> cumsum. Some inspiration might come from >> http://www.mail-archive.com/r-help at r-project.org/msg75280.html >> >> Also the end points might need fiddling (e.g., by padding rle with 'win' >> trailing zeros, which is I think in effect what successiveViews does. >> >> Martin >> >>> >>> Thank you very much >>> Michael >>> >>> Code example: >>> >>> diff.op <- function(x, lrprop=1/2) { >>> len = length(x) >>> i = ceiling(len*lrprop) >>> (sum(x[i:len]) - sum(x[1:i])) / len >>> } >>> >>> sliding.window.apply <- function(object, width, fun, ...) { >>> x <- trim(successiveViews(subject=object, width=rep(width, ceiling(length(object)) ), gap=-width+1)) >>> return (Rle(viewApply(x, fun, ...))) >>> } >>> >>> sliding.window.mean <- function(object, width) { >>> x <- trim(successiveViews(subject=object, width=rep(width, ceiling(length(object)) ), gap=-width+1)) >>> return (viewMeans(x)) >>> } >>> >>> rle <- Rle(1:10000) >>> >>>> system.time(sliding.window.mean(rle, 30)) >>> user system elapsed >>> 0.036 0.004 0.098 >>>> system.time(sliding.window.apply(rle, 30, mean)) >>> user system elapsed >>> 4.380 0.065 6.010 >>>> system.time(sliding.window.apply(rle, 30, diff.op)) >>> user system elapsed >>> 38.857 0.204 39.127 >>> >>>> sessionInfo() >>> R version 2.11.1 (2010-05-31) >>> x86_64-apple-darwin9.8.0 >>> >>> locale: >>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] IRanges_1.6.6 >>> >>> loaded via a namespace (and not attached): >>> [1] annotate_1.26.1 AnnotationDbi_1.10.2 Biobase_2.8.0 DBI_0.2-5 DESeq_1.0.4 >>> [6] genefilter_1.30.0 geneplotter_1.26.0 grid_2.11.1 RColorBrewer_1.0-2 RSQLite_0.9-1 >>> [11] splines_2.11.1 survival_2.35-8 xtable_1.5-6 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > >
ADD REPLY

Login before adding your answer.

Traffic: 955 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6