IRanges: Request for a "step" argument in runsum
1
0
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States
Hi Michael, Arnaud, On 11-05-25 08:21 AM, Arnaud Amzallag wrote: > Thank you Michael, > > for some email filters reasons I saw your reply only now. > > I recon that in my case that would have been much smoother if sum() would > call viewSums() by default and I agree that "It's more intuitive to think of > an RleViews like an RleList rather than an IntegerList.". I would support > that change. I've recently made some changes in that direction in IRanges devel. This subtle shift in the nature of an RleViews (from IntegerList to RleList) has some deep consequences on the hierarchy of classes in IRanges. The most important one is that the Views class doesn't extend the IRanges class anymore. Views is now a direct subclass of List. This means that you cannot directly manipulate a Views object as if it was an IRanges object but you must extract its ranges (with ranges()) first. This is a work-in-progress and a few things still need to be polished. Also I agree that we should just have "min", "max", "sum", "mean" etc methods for Views objects. No need for viewMins, viewMaxs, viewSums etc I'll change this. Cheers, H. > > Also it is possible that before I was summing the values of the Rle and did > not notice the difference because my Rle was made of a lot of very short Rle > lengths. > > Arnaud > > On Tue, May 10, 2011 at 8:44 AM, Michael Lawrence<lawrence.michael at="" gene.com="">> wrote: > >> Good to hear that helped. One might expect sum() to simply call viewSums(), >> but the semantics are a bit strange here. The reason sum() works on Views is >> that a Views is a Ranges and thus an IntegerList (where each range encodes a >> sequence of integers). The weird thing is that the elements of a Views are >> not the sequence of integers covered but rather the values in the Rle. That >> everything works as you expected is just a coincidence of dispatch. >> >> For usability we should probably have max(), min(), and sum() just use >> viewMaxs, viewMins and viewSums. It's more intuitive to think of an RleViews >> like an RleList rather than an IntegerList. >> >> >> On Sun, May 8, 2011 at 1:44 PM, Arnaud Amzallag<arnaud.amzallag at="" gmail.com="">>> wrote: >> >>> Thank you Michael, the function viewSums was exactly what I needed ! >>> >>> 0.014 seconds for viewSums(Views(myrle, ir)) vs 54 seconds for >>> sum(Views(myrle, ir)) on chr22, one sample. I use this now instead of of >>> runsum, no problem of memory, and probably even faster. for full the genome >>> on many samples that will surely help. Maybe I should have read a bit more >>> about the Views. >>> >>> About the result of runsum, I did see a lot of memory usage when I split >>> the process with mclapply. The result is indeed a Rle. After looking closer, >>> the resuting Rle has much more runs that the original one. That makes sense, >>> because runsum is a kind of smoothing function, and the resulting signal has >>> much more levels than the original one. >>> >>> Kind regards, >>> >>> Arnaud >>> >>> On May 6, 2011, at 10:42 PM, Michael Lawrence wrote: >>> >>> >>> >>> On Fri, May 6, 2011 at 2:54 PM, Arnaud Amzallag< >>> arnaud.amzallag at gmail.com> wrote: >>> >>>> Dear IRanges developers, >>>> >>>> runsum is a very fast and convenient function to compute on Rle >>>> coverages, for instance. However when it is run on several chromosomes and >>>> several samples, it can get very memory intensive. For instance on human >>>> chromosome 1, it outputs a vector of length 250 millions, so for several >>>> full genomes it is quickly billions of numbers in memory. >>>> >>>> >>> I would have expected the result to be an Rle, which would be fairly >>> memory efficient. >>> >>> >>>> However, often you don't need a single base resolution. I wanted to >>>> suggest, if it is possible, to add a parameter by which one could have the >>>> sliding window to slide by a user defined step, rather than always "step=1", >>>> as it is now. Such that runsum(myRle, k=1e4, step = 1000) would return the >>>> equivalent of a wig file, for each 10 kilobases of the genome, without >>>> saturating the memory of the server. >>>> >>>> I tried with sum(Views(myRle, ir)), it is less memory intensive but it is >>>> much slower. So that amelioration would give the best of both worlds, fast >>>> and memory efficient. >>>> >>>> >>> Have you tried viewSums(Views(myRle, ir))? >>> >>> >>>> kind regards, >>>> >>>> Arnaud Amzallag >>>> Research Fellow >>>> Mass general Cancer Center / Harvard Medical school >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> >>> >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
Cancer PROcess IRanges genomes Cancer PROcess IRanges genomes • 735 views
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 2.4 years ago
United States
2011/6/28 Hervé Pagès <hpages@fhcrc.org> > Hi Michael, Arnaud, > > > On 11-05-25 08:21 AM, Arnaud Amzallag wrote: > >> Thank you Michael, >> >> for some email filters reasons I saw your reply only now. >> >> I recon that in my case that would have been much smoother if sum() would >> call viewSums() by default and I agree that "It's more intuitive to think >> of >> an RleViews like an RleList rather than an IntegerList.". I would support >> that change. >> > > I've recently made some changes in that direction in IRanges devel. > This subtle shift in the nature of an RleViews (from IntegerList > to RleList) has some deep consequences on the hierarchy of classes > in IRanges. The most important one is that the Views class doesn't > extend the IRanges class anymore. Views is now a direct subclass > of List. This means that you cannot directly manipulate a Views object > as if it was an IRanges object but you must extract its ranges (with > ranges()) first. > > This is pretty unfortunate. We lose the ability to e.g. store the coverage inside RangedData (because a ViewsList is a RangesList). Is it really necessary for what was proposed? > This is a work-in-progress and a few things still need to be polished. > > Also I agree that we should just have "min", "max", "sum", "mean" etc > methods for Views objects. No need for viewMins, viewMaxs, viewSums etc > I'll change this. > > Cheers, > H. > > >> Also it is possible that before I was summing the values of the Rle and >> did >> not notice the difference because my Rle was made of a lot of very short >> Rle >> lengths. >> >> Arnaud >> >> On Tue, May 10, 2011 at 8:44 AM, Michael Lawrence<lawrence.michael@**>> gene.com <lawrence.michael@gene.com> >> >>> wrote: >>> >> >> Good to hear that helped. One might expect sum() to simply call >>> viewSums(), >>> but the semantics are a bit strange here. The reason sum() works on Views >>> is >>> that a Views is a Ranges and thus an IntegerList (where each range >>> encodes a >>> sequence of integers). The weird thing is that the elements of a Views >>> are >>> not the sequence of integers covered but rather the values in the Rle. >>> That >>> everything works as you expected is just a coincidence of dispatch. >>> >>> For usability we should probably have max(), min(), and sum() just use >>> viewMaxs, viewMins and viewSums. It's more intuitive to think of an >>> RleViews >>> like an RleList rather than an IntegerList. >>> >>> >>> On Sun, May 8, 2011 at 1:44 PM, Arnaud Amzallag<arnaud.amzallag@**>>> gmail.com <arnaud.amzallag@gmail.com> >>> >>>> wrote: >>>> >>> >>> Thank you Michael, the function viewSums was exactly what I needed ! >>>> >>>> 0.014 seconds for viewSums(Views(myrle, ir)) vs 54 seconds for >>>> sum(Views(myrle, ir)) on chr22, one sample. I use this now instead of of >>>> runsum, no problem of memory, and probably even faster. for full the >>>> genome >>>> on many samples that will surely help. Maybe I should have read a bit >>>> more >>>> about the Views. >>>> >>>> About the result of runsum, I did see a lot of memory usage when I split >>>> the process with mclapply. The result is indeed a Rle. After looking >>>> closer, >>>> the resuting Rle has much more runs that the original one. That makes >>>> sense, >>>> because runsum is a kind of smoothing function, and the resulting signal >>>> has >>>> much more levels than the original one. >>>> >>>> Kind regards, >>>> >>>> Arnaud >>>> >>>> On May 6, 2011, at 10:42 PM, Michael Lawrence wrote: >>>> >>>> >>>> >>>> On Fri, May 6, 2011 at 2:54 PM, Arnaud Amzallag< >>>> arnaud.amzallag@gmail.com> wrote: >>>> >>>> Dear IRanges developers, >>>>> >>>>> runsum is a very fast and convenient function to compute on Rle >>>>> coverages, for instance. However when it is run on several chromosomes >>>>> and >>>>> several samples, it can get very memory intensive. For instance on >>>>> human >>>>> chromosome 1, it outputs a vector of length 250 millions, so for >>>>> several >>>>> full genomes it is quickly billions of numbers in memory. >>>>> >>>>> >>>>> I would have expected the result to be an Rle, which would be fairly >>>> memory efficient. >>>> >>>> >>>> However, often you don't need a single base resolution. I wanted to >>>>> suggest, if it is possible, to add a parameter by which one could have >>>>> the >>>>> sliding window to slide by a user defined step, rather than always >>>>> "step=1", >>>>> as it is now. Such that runsum(myRle, k=1e4, step = 1000) would return >>>>> the >>>>> equivalent of a wig file, for each 10 kilobases of the genome, without >>>>> saturating the memory of the server. >>>>> >>>>> I tried with sum(Views(myRle, ir)), it is less memory intensive but it >>>>> is >>>>> much slower. So that amelioration would give the best of both worlds, >>>>> fast >>>>> and memory efficient. >>>>> >>>>> >>>>> Have you tried viewSums(Views(myRle, ir))? >>>> >>>> >>>> kind regards, >>>>> >>>>> Arnaud Amzallag >>>>> Research Fellow >>>>> Mass general Cancer Center / Harvard Medical school >>>>> ______________________________**_________________ >>>>> Bioconductor mailing list >>>>> Bioconductor@r-project.org >>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: sta="" t.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.**science.biology.informatics.**cond uctor<http: news.gmane.org="" gmane.science.biology.informatics.conducto="" r=""> >>>>> >>>>> >>>> >>>> >>>> >>> >> [[alternative HTML version deleted]] >> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]
ADD COMMENT

Login before adding your answer.

Traffic: 547 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6