IRanges: Request for a "step" argument in runsum

0

Entering edit mode

Arnaud Amzallag ▴ 100

@arnaud-amzallag-4471

Last seen 7.1 years ago

Dear IRanges developers, runsum is a very fast and convenient function to compute on Rle coverages, for instance. However when it is run on several chromosomes and several samples, it can get very memory intensive. For instance on human chromosome 1, it outputs a vector of length 250 millions, so for several full genomes it is quickly billions of numbers in memory. However, often you don't need a single base resolution. I wanted to suggest, if it is possible, to add a parameter by which one could have the sliding window to slide by a user defined step, rather than always "step=1", as it is now. Such that runsum(myRle, k=1e4, step = 1000) would return the equivalent of a wig file, for each 10 kilobases of the genome, without saturating the memory of the server. I tried with sum(Views(myRle, ir)), it is less memory intensive but it is much slower. So that amelioration would give the best of both worlds, fast and memory efficient. kind regards, Arnaud Amzallag Research Fellow Mass general Cancer Center / Harvard Medical school

Cancer IRanges genomes Cancer IRanges genomes • 1.1k views

ADD COMMENT • link updated 13.0 years ago by Michael Lawrence ★ 11k • written 13.0 years ago by Arnaud Amzallag ▴ 100

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

On Fri, May 6, 2011 at 2:54 PM, Arnaud Amzallag <arnaud.amzallag@gmail.com>wrote: > Dear IRanges developers, > > runsum is a very fast and convenient function to compute on Rle coverages, > for instance. However when it is run on several chromosomes and several > samples, it can get very memory intensive. For instance on human chromosome > 1, it outputs a vector of length 250 millions, so for several full genomes > it is quickly billions of numbers in memory. > > I would have expected the result to be an Rle, which would be fairly memory efficient. > However, often you don't need a single base resolution. I wanted to > suggest, if it is possible, to add a parameter by which one could have the > sliding window to slide by a user defined step, rather than always "step=1", > as it is now. Such that runsum(myRle, k=1e4, step = 1000) would return the > equivalent of a wig file, for each 10 kilobases of the genome, without > saturating the memory of the server. > > I tried with sum(Views(myRle, ir)), it is less memory intensive but it is > much slower. So that amelioration would give the best of both worlds, fast > and memory efficient. > > Have you tried viewSums(Views(myRle, ir))? > kind regards, > > Arnaud Amzallag > Research Fellow > Mass general Cancer Center / Harvard Medical school > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 13.0 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thank you Michael, the function viewSums was exactly what I needed ! 0.014 seconds for viewSums(Views(myrle, ir)) vs 54 seconds for sum(Views(myrle, ir)) on chr22, one sample. I use this now instead of of runsum, no problem of memory, and probably even faster. for full the genome on many samples that will surely help. Maybe I should have read a bit more about the Views. About the result of runsum, I did see a lot of memory usage when I split the process with mclapply. The result is indeed a Rle. After looking closer, the resuting Rle has much more runs that the original one. That makes sense, because runsum is a kind of smoothing function, and the resulting signal has much more levels than the original one. Kind regards, Arnaud On May 6, 2011, at 10:42 PM, Michael Lawrence wrote: > > > On Fri, May 6, 2011 at 2:54 PM, Arnaud Amzallag <arnaud.amzallag@gmail.com> wrote: > Dear IRanges developers, > > runsum is a very fast and convenient function to compute on Rle coverages, for instance. However when it is run on several chromosomes and several samples, it can get very memory intensive. For instance on human chromosome 1, it outputs a vector of length 250 millions, so for several full genomes it is quickly billions of numbers in memory. > > > I would have expected the result to be an Rle, which would be fairly memory efficient. > > However, often you don't need a single base resolution. I wanted to suggest, if it is possible, to add a parameter by which one could have the sliding window to slide by a user defined step, rather than always "step=1", as it is now. Such that runsum(myRle, k=1e4, step = 1000) would return the equivalent of a wig file, for each 10 kilobases of the genome, without saturating the memory of the server. > > I tried with sum(Views(myRle, ir)), it is less memory intensive but it is much slower. So that amelioration would give the best of both worlds, fast and memory efficient. > > > Have you tried viewSums(Views(myRle, ir))? > > kind regards, > > Arnaud Amzallag > Research Fellow > Mass general Cancer Center / Harvard Medical school > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 13.0 years ago Arnaud Amzallag ▴ 100

0

Entering edit mode

Good to hear that helped. One might expect sum() to simply call viewSums(), but the semantics are a bit strange here. The reason sum() works on Views is that a Views is a Ranges and thus an IntegerList (where each range encodes a sequence of integers). The weird thing is that the elements of a Views are not the sequence of integers covered but rather the values in the Rle. That everything works as you expected is just a coincidence of dispatch. For usability we should probably have max(), min(), and sum() just use viewMaxs, viewMins and viewSums. It's more intuitive to think of an RleViews like an RleList rather than an IntegerList. On Sun, May 8, 2011 at 1:44 PM, Arnaud Amzallag <arnaud.amzallag@gmail.com>wrote: > Thank you Michael, the function viewSums was exactly what I needed ! > > 0.014 seconds for viewSums(Views(myrle, ir)) vs 54 seconds for > sum(Views(myrle, ir)) on chr22, one sample. I use this now instead of of > runsum, no problem of memory, and probably even faster. for full the genome > on many samples that will surely help. Maybe I should have read a bit more > about the Views. > > About the result of runsum, I did see a lot of memory usage when I split > the process with mclapply. The result is indeed a Rle. After looking closer, > the resuting Rle has much more runs that the original one. That makes sense, > because runsum is a kind of smoothing function, and the resulting signal has > much more levels than the original one. > > Kind regards, > > Arnaud > > On May 6, 2011, at 10:42 PM, Michael Lawrence wrote: > > > > On Fri, May 6, 2011 at 2:54 PM, Arnaud Amzallag <arnaud.amzallag@gmail.com> > wrote: > >> Dear IRanges developers, >> >> runsum is a very fast and convenient function to compute on Rle coverages, >> for instance. However when it is run on several chromosomes and several >> samples, it can get very memory intensive. For instance on human chromosome >> 1, it outputs a vector of length 250 millions, so for several full genomes >> it is quickly billions of numbers in memory. >> >> > I would have expected the result to be an Rle, which would be fairly memory > efficient. > > >> However, often you don't need a single base resolution. I wanted to >> suggest, if it is possible, to add a parameter by which one could have the >> sliding window to slide by a user defined step, rather than always "step=1", >> as it is now. Such that runsum(myRle, k=1e4, step = 1000) would return the >> equivalent of a wig file, for each 10 kilobases of the genome, without >> saturating the memory of the server. >> >> I tried with sum(Views(myRle, ir)), it is less memory intensive but it is >> much slower. So that amelioration would give the best of both worlds, fast >> and memory efficient. >> >> > Have you tried viewSums(Views(myRle, ir))? > > >> kind regards, >> >> Arnaud Amzallag >> Research Fellow >> Mass general Cancer Center / Harvard Medical school >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > [[alternative HTML version deleted]]

ADD REPLY • link 13.0 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thank you Michael, for some email filters reasons I saw your reply only now. I recon that in my case that would have been much smoother if sum() would call viewSums() by default and I agree that "It's more intuitive to think of an RleViews like an RleList rather than an IntegerList.". I would support that change. Also it is possible that before I was summing the values of the Rle and did not notice the difference because my Rle was made of a lot of very short Rle lengths. Arnaud On Tue, May 10, 2011 at 8:44 AM, Michael Lawrence <lawrence.michael@gene.com> wrote: > Good to hear that helped. One might expect sum() to simply call viewSums(), > but the semantics are a bit strange here. The reason sum() works on Views is > that a Views is a Ranges and thus an IntegerList (where each range encodes a > sequence of integers). The weird thing is that the elements of a Views are > not the sequence of integers covered but rather the values in the Rle. That > everything works as you expected is just a coincidence of dispatch. > > For usability we should probably have max(), min(), and sum() just use > viewMaxs, viewMins and viewSums. It's more intuitive to think of an RleViews > like an RleList rather than an IntegerList. > > > On Sun, May 8, 2011 at 1:44 PM, Arnaud Amzallag <arnaud.amzallag@gmail.com> > wrote: > >> Thank you Michael, the function viewSums was exactly what I needed ! >> >> 0.014 seconds for viewSums(Views(myrle, ir)) vs 54 seconds for >> sum(Views(myrle, ir)) on chr22, one sample. I use this now instead of of >> runsum, no problem of memory, and probably even faster. for full the genome >> on many samples that will surely help. Maybe I should have read a bit more >> about the Views. >> >> About the result of runsum, I did see a lot of memory usage when I split >> the process with mclapply. The result is indeed a Rle. After looking closer, >> the resuting Rle has much more runs that the original one. That makes sense, >> because runsum is a kind of smoothing function, and the resulting signal has >> much more levels than the original one. >> >> Kind regards, >> >> Arnaud >> >> On May 6, 2011, at 10:42 PM, Michael Lawrence wrote: >> >> >> >> On Fri, May 6, 2011 at 2:54 PM, Arnaud Amzallag < >> arnaud.amzallag@gmail.com> wrote: >> >>> Dear IRanges developers, >>> >>> runsum is a very fast and convenient function to compute on Rle >>> coverages, for instance. However when it is run on several chromosomes and >>> several samples, it can get very memory intensive. For instance on human >>> chromosome 1, it outputs a vector of length 250 millions, so for several >>> full genomes it is quickly billions of numbers in memory. >>> >>> >> I would have expected the result to be an Rle, which would be fairly >> memory efficient. >> >> >>> However, often you don't need a single base resolution. I wanted to >>> suggest, if it is possible, to add a parameter by which one could have the >>> sliding window to slide by a user defined step, rather than always "step=1", >>> as it is now. Such that runsum(myRle, k=1e4, step = 1000) would return the >>> equivalent of a wig file, for each 10 kilobases of the genome, without >>> saturating the memory of the server. >>> >>> I tried with sum(Views(myRle, ir)), it is less memory intensive but it is >>> much slower. So that amelioration would give the best of both worlds, fast >>> and memory efficient. >>> >>> >> Have you tried viewSums(Views(myRle, ir))? >> >> >>> kind regards, >>> >>> Arnaud Amzallag >>> Research Fellow >>> Mass general Cancer Center / Harvard Medical school >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> >> > [[alternative HTML version deleted]]

ADD REPLY • link 13.0 years ago Arnaud Amzallag ▴ 100

Login before adding your answer.