Question: pooling for parallel hierarchical operations
0
gravatar for Michael Lawrence
6.3 years ago by
United States
Michael Lawrence10k wrote:
We often execute nested operations in parallel. For example, first by sample, then by chromosome. Fixed allocation of resources to each level will often result in waste. For example, if one sample finishes quickly, its CPUs are not available to help the other samples along. Perhaps the most expedient solution is to expand.grid() the hierarchy and create one job for every combination, i.e., flatten the hierarchy. A more ideal solution might be a pool of resources (cores) that are allocated more fluidly. Is there any sort of pooling system for R? I know that the parallel package supports the declaration of resources in cluster objects, but there is no central manager. This is a general R question, but it's worth discussing in the context of how we can make better use of parallelism in the low-level infrastructure, which would cause these hierarchies to arise. It's also relevant to the discussion of specifying parallelization modes or strategies. Pools themselves could be hierarchical and heterogeneous (hosts, cores). Declaring available resources is fairly straight-forward. Deciding how to use them is context dependent and requires user control. Michael [[alternative HTML version deleted]]
infrastructure • 423 views
ADD COMMENTlink modified 6.3 years ago by Martin Morgan ♦♦ 23k • written 6.3 years ago by Michael Lawrence10k
Answer: pooling for parallel hierarchical operations
0
gravatar for Malcolm Cook
6.3 years ago by
Malcolm Cook1.5k
United States
Malcolm Cook1.5k wrote:
Michael, Have you seen http://cran.r-project.org/web/packages/doRedis/index.html ?? If you take a look and come across a description of internals/architecture, please share.... Cheers, ~Malcolm > -----Original Message----- > From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of Michael Lawrence > Sent: Wednesday, November 14, 2012 8:41 AM > To: Bioconductor List > Subject: [BioC] pooling for parallel hierarchical operations > > We often execute nested operations in parallel. For example, first by > sample, then by chromosome. Fixed allocation of resources to each level > will often result in waste. For example, if one sample finishes quickly, > its CPUs are not available to help the other samples along. Perhaps the > most expedient solution is to expand.grid() the hierarchy and create one > job for every combination, i.e., flatten the hierarchy. A more ideal > solution might be a pool of resources (cores) that are allocated more > fluidly. Is there any sort of pooling system for R? I know that the > parallel package supports the declaration of resources in cluster objects, > but there is no central manager. This is a general R question, but it's > worth discussing in the context of how we can make better use of > parallelism in the low-level infrastructure, which would cause these > hierarchies to arise. It's also relevant to the discussion of specifying > parallelization modes or strategies. Pools themselves could be hierarchical > and heterogeneous (hosts, cores). Declaring available resources is fairly > straight-forward. Deciding how to use them is context dependent and > requires user control. > > Michael > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENTlink written 6.3 years ago by Malcolm Cook1.5k
I hadn't seen that, thanks. It looks like a nice mechanism for passing messages and sharing data between multiple clients. It would be interesting if someone created an R environment based on a dynamic object tables- based backend that shared data via redis. I don't see anything about managing of resource pools though. Michael On Wed, Nov 14, 2012 at 8:20 AM, Cook, Malcolm <mec@stowers.org> wrote: > Michael, > > Have you seen http://cran.r-project.org/web/packages/doRedis/index.html?? > > If you take a look and come across a description of > internals/architecture, please share.... > > Cheers, > > ~Malcolm > > > > -----Original Message----- > > From: bioconductor-bounces@r-project.org [mailto: > bioconductor-bounces@r-project.org] On Behalf Of Michael Lawrence > > Sent: Wednesday, November 14, 2012 8:41 AM > > To: Bioconductor List > > Subject: [BioC] pooling for parallel hierarchical operations > > > > We often execute nested operations in parallel. For example, first by > > sample, then by chromosome. Fixed allocation of resources to each level > > will often result in waste. For example, if one sample finishes quickly, > > its CPUs are not available to help the other samples along. Perhaps the > > most expedient solution is to expand.grid() the hierarchy and create one > > job for every combination, i.e., flatten the hierarchy. A more ideal > > solution might be a pool of resources (cores) that are allocated more > > fluidly. Is there any sort of pooling system for R? I know that the > > parallel package supports the declaration of resources in cluster > objects, > > but there is no central manager. This is a general R question, but it's > > worth discussing in the context of how we can make better use of > > parallelism in the low-level infrastructure, which would cause these > > hierarchies to arise. It's also relevant to the discussion of specifying > > parallelization modes or strategies. Pools themselves could be > hierarchical > > and heterogeneous (hosts, cores). Declaring available resources is fairly > > straight-forward. Deciding how to use them is context dependent and > > requires user control. > > > > Michael > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLYlink written 6.3 years ago by Michael Lawrence10k
Answer: pooling for parallel hierarchical operations
0
gravatar for Martin Morgan
6.3 years ago by
Martin Morgan ♦♦ 23k
United States
Martin Morgan ♦♦ 23k wrote:
On 11/14/2012 6:40 AM, Michael Lawrence wrote: > We often execute nested operations in parallel. For example, first by > sample, then by chromosome. Fixed allocation of resources to each level > will often result in waste. For example, if one sample finishes quickly, > its CPUs are not available to help the other samples along. Perhaps the > most expedient solution is to expand.grid() the hierarchy and create one > job for every combination, i.e., flatten the hierarchy. A more ideal > solution might be a pool of resources (cores) that are allocated more > fluidly. Is there any sort of pooling system for R? I know that the > parallel package supports the declaration of resources in cluster objects, > but there is no central manager. This is a general R question, but it's > worth discussing in the context of how we can make better use of > parallelism in the low-level infrastructure, which would cause these > hierarchies to arise. It's also relevant to the discussion of specifying > parallelization modes or strategies. Pools themselves could be hierarchical > and heterogeneous (hosts, cores). Declaring available resources is fairly > straight-forward. Deciding how to use them is context dependent and > requires user control. Hi Michael -- Don't really have an answer for you but (a) sounds like you're looking for a scheduler, with the idea that the 'workers' have a deque of tasks that they are responsible for, but with some kind of collaboration between workers to balance tasks. I don't think the user should have (or have to) influence on the scheduler, it mostly just does the right thing. I think it would be good to develop scheduler(s) orthogonal to the parallel algorithm (lapply, pvec, map/reduce, etc). I've started a BiocParallel package in Bioconductor's svn and on github https://github.com/Bioconductor/BiocParallel so that might provide a place to focus this development; I'd encourage use of github and it's social coding as the primary means for development at this time. Martin > > Michael > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Dr. Martin Morgan, PhD Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
ADD COMMENTlink written 6.3 years ago by Martin Morgan ♦♦ 23k
Thanks for setting this up. I think we might want to look into how other high-level languages have approached these issues. The user will need some high-level control. For example, only the user is going to know how much memory a job will consume. I'm sure there are heuristics and simplifying assumptions/constraints that will go a long way towards autonomy though. Michael On Wed, Nov 14, 2012 at 12:32 PM, Martin Morgan <mtmorgan@fhcrc.org> wrote: > On 11/14/2012 6:40 AM, Michael Lawrence wrote: > >> We often execute nested operations in parallel. For example, first by >> sample, then by chromosome. Fixed allocation of resources to each level >> will often result in waste. For example, if one sample finishes quickly, >> its CPUs are not available to help the other samples along. Perhaps the >> most expedient solution is to expand.grid() the hierarchy and create one >> job for every combination, i.e., flatten the hierarchy. A more ideal >> solution might be a pool of resources (cores) that are allocated more >> fluidly. Is there any sort of pooling system for R? I know that the >> parallel package supports the declaration of resources in cluster objects, >> but there is no central manager. This is a general R question, but it's >> worth discussing in the context of how we can make better use of >> parallelism in the low-level infrastructure, which would cause these >> hierarchies to arise. It's also relevant to the discussion of specifying >> parallelization modes or strategies. Pools themselves could be >> hierarchical >> and heterogeneous (hosts, cores). Declaring available resources is fairly >> straight-forward. Deciding how to use them is context dependent and >> requires user control. >> > > Hi Michael -- Don't really have an answer for you but (a) sounds like > you're looking for a scheduler, with the idea that the 'workers' have a > deque of tasks that they are responsible for, but with some kind of > collaboration between workers to balance tasks. I don't think the user > should have (or have to) influence on the scheduler, it mostly just does > the right thing. I think it would be good to develop scheduler(s) > orthogonal to the parallel algorithm (lapply, pvec, map/reduce, etc). > > I've started a BiocParallel package in Bioconductor's svn and on github > > https://github.com/**Bioconductor/BiocParallel<https: github.com="" bioconductor="" biocparallel=""> > > so that might provide a place to focus this development; I'd encourage use > of github and it's social coding as the primary means for development at > this time. > > Martin > > > >> Michael >> >> [[alternative HTML version deleted]] >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> > > -- > Dr. Martin Morgan, PhD > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > [[alternative HTML version deleted]]
ADD REPLYlink written 6.3 years ago by Michael Lawrence10k
On Wed, Nov 14, 2012 at 5:06 PM, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: > Thanks for setting this up. I think we might want to look into how other > high-level languages have approached these issues. The user will need some > high-level control. For example, only the user is going to know how much > memory a job will consume. I'm sure there are heuristics and simplifying > assumptions/constraints that will go a long way towards autonomy though. I've got ten bucks on Michael coming back 2-3 weeks from now with his own bioakkaR library: http://akka.io Who's in? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLYlink written 6.3 years ago by Steve Lianoglou12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 238 users visited in the last hour