pooling for parallel hierarchical operations

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

We often execute nested operations in parallel. For example, first by sample, then by chromosome. Fixed allocation of resources to each level will often result in waste. For example, if one sample finishes quickly, its CPUs are not available to help the other samples along. Perhaps the most expedient solution is to expand.grid() the hierarchy and create one job for every combination, i.e., flatten the hierarchy. A more ideal solution might be a pool of resources (cores) that are allocated more fluidly. Is there any sort of pooling system for R? I know that the parallel package supports the declaration of resources in cluster objects, but there is no central manager. This is a general R question, but it's worth discussing in the context of how we can make better use of parallelism in the low-level infrastructure, which would cause these hierarchies to arise. It's also relevant to the discussion of specifying parallelization modes or strategies. Pools themselves could be hierarchical and heterogeneous (hosts, cores). Declaring available resources is fairly straight-forward. Deciding how to use them is context dependent and requires user control. Michael [[alternative HTML version deleted]]

Infrastructure Infrastructure • 1.1k views

ADD COMMENT • link updated 11.4 years ago by Martin Morgan 25k • written 11.4 years ago by Michael Lawrence ★ 11k

0

Entering edit mode

Malcolm Cook ★ 1.6k

@malcolm-cook-6293

Last seen 2 days ago

United States

Michael, Have you seen http://cran.r-project.org/web/packages/doRedis/index.html ?? If you take a look and come across a description of internals/architecture, please share.... Cheers, ~Malcolm > -----Original Message----- > From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of Michael Lawrence > Sent: Wednesday, November 14, 2012 8:41 AM > To: Bioconductor List > Subject: [BioC] pooling for parallel hierarchical operations > > We often execute nested operations in parallel. For example, first by > sample, then by chromosome. Fixed allocation of resources to each level > will often result in waste. For example, if one sample finishes quickly, > its CPUs are not available to help the other samples along. Perhaps the > most expedient solution is to expand.grid() the hierarchy and create one > job for every combination, i.e., flatten the hierarchy. A more ideal > solution might be a pool of resources (cores) that are allocated more > fluidly. Is there any sort of pooling system for R? I know that the > parallel package supports the declaration of resources in cluster objects, > but there is no central manager. This is a general R question, but it's > worth discussing in the context of how we can make better use of > parallelism in the low-level infrastructure, which would cause these > hierarchies to arise. It's also relevant to the discussion of specifying > parallelization modes or strategies. Pools themselves could be hierarchical > and heterogeneous (hosts, cores). Declaring available resources is fairly > straight-forward. Deciding how to use them is context dependent and > requires user control. > > Michael > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 11.4 years ago Malcolm Cook ★ 1.6k

0

Entering edit mode

I hadn't seen that, thanks. It looks like a nice mechanism for passing messages and sharing data between multiple clients. It would be interesting if someone created an R environment based on a dynamic object tables- based backend that shared data via redis. I don't see anything about managing of resource pools though. Michael On Wed, Nov 14, 2012 at 8:20 AM, Cook, Malcolm <mec@stowers.org> wrote: > Michael, > > Have you seen http://cran.r-project.org/web/packages/doRedis/index.html?? > > If you take a look and come across a description of > internals/architecture, please share.... > > Cheers, > > ~Malcolm > > > > -----Original Message----- > > From: bioconductor-bounces@r-project.org [mailto: > bioconductor-bounces@r-project.org] On Behalf Of Michael Lawrence > > Sent: Wednesday, November 14, 2012 8:41 AM > > To: Bioconductor List > > Subject: [BioC] pooling for parallel hierarchical operations > > > > We often execute nested operations in parallel. For example, first by > > sample, then by chromosome. Fixed allocation of resources to each level > > will often result in waste. For example, if one sample finishes quickly, > > its CPUs are not available to help the other samples along. Perhaps the > > most expedient solution is to expand.grid() the hierarchy and create one > > job for every combination, i.e., flatten the hierarchy. A more ideal > > solution might be a pool of resources (cores) that are allocated more > > fluidly. Is there any sort of pooling system for R? I know that the > > parallel package supports the declaration of resources in cluster > objects, > > but there is no central manager. This is a general R question, but it's > > worth discussing in the context of how we can make better use of > > parallelism in the low-level infrastructure, which would cause these > > hierarchies to arise. It's also relevant to the discussion of specifying > > parallelization modes or strategies. Pools themselves could be > hierarchical > > and heterogeneous (hosts, cores). Declaring available resources is fairly > > straight-forward. Deciding how to use them is context dependent and > > requires user control. > > > > Michael > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 11.4 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 20 hours ago

United States

On 11/14/2012 6:40 AM, Michael Lawrence wrote: > We often execute nested operations in parallel. For example, first by > sample, then by chromosome. Fixed allocation of resources to each level > will often result in waste. For example, if one sample finishes quickly, > its CPUs are not available to help the other samples along. Perhaps the > most expedient solution is to expand.grid() the hierarchy and create one > job for every combination, i.e., flatten the hierarchy. A more ideal > solution might be a pool of resources (cores) that are allocated more > fluidly. Is there any sort of pooling system for R? I know that the > parallel package supports the declaration of resources in cluster objects, > but there is no central manager. This is a general R question, but it's > worth discussing in the context of how we can make better use of > parallelism in the low-level infrastructure, which would cause these > hierarchies to arise. It's also relevant to the discussion of specifying > parallelization modes or strategies. Pools themselves could be hierarchical > and heterogeneous (hosts, cores). Declaring available resources is fairly > straight-forward. Deciding how to use them is context dependent and > requires user control. Hi Michael -- Don't really have an answer for you but (a) sounds like you're looking for a scheduler, with the idea that the 'workers' have a deque of tasks that they are responsible for, but with some kind of collaboration between workers to balance tasks. I don't think the user should have (or have to) influence on the scheduler, it mostly just does the right thing. I think it would be good to develop scheduler(s) orthogonal to the parallel algorithm (lapply, pvec, map/reduce, etc). I've started a BiocParallel package in Bioconductor's svn and on github https://github.com/Bioconductor/BiocParallel so that might provide a place to focus this development; I'd encourage use of github and it's social coding as the primary means for development at this time. Martin > > Michael > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Dr. Martin Morgan, PhD Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

ADD COMMENT • link 11.4 years ago Martin Morgan 25k

0

Entering edit mode

Thanks for setting this up. I think we might want to look into how other high-level languages have approached these issues. The user will need some high-level control. For example, only the user is going to know how much memory a job will consume. I'm sure there are heuristics and simplifying assumptions/constraints that will go a long way towards autonomy though. Michael On Wed, Nov 14, 2012 at 12:32 PM, Martin Morgan <mtmorgan@fhcrc.org> wrote: > On 11/14/2012 6:40 AM, Michael Lawrence wrote: > >> We often execute nested operations in parallel. For example, first by >> sample, then by chromosome. Fixed allocation of resources to each level >> will often result in waste. For example, if one sample finishes quickly, >> its CPUs are not available to help the other samples along. Perhaps the >> most expedient solution is to expand.grid() the hierarchy and create one >> job for every combination, i.e., flatten the hierarchy. A more ideal >> solution might be a pool of resources (cores) that are allocated more >> fluidly. Is there any sort of pooling system for R? I know that the >> parallel package supports the declaration of resources in cluster objects, >> but there is no central manager. This is a general R question, but it's >> worth discussing in the context of how we can make better use of >> parallelism in the low-level infrastructure, which would cause these >> hierarchies to arise. It's also relevant to the discussion of specifying >> parallelization modes or strategies. Pools themselves could be >> hierarchical >> and heterogeneous (hosts, cores). Declaring available resources is fairly >> straight-forward. Deciding how to use them is context dependent and >> requires user control. >> > > Hi Michael -- Don't really have an answer for you but (a) sounds like > you're looking for a scheduler, with the idea that the 'workers' have a > deque of tasks that they are responsible for, but with some kind of > collaboration between workers to balance tasks. I don't think the user > should have (or have to) influence on the scheduler, it mostly just does > the right thing. I think it would be good to develop scheduler(s) > orthogonal to the parallel algorithm (lapply, pvec, map/reduce, etc). > > I've started a BiocParallel package in Bioconductor's svn and on github > > https://github.com/**Bioconductor/BiocParallel<https: github.com="" bioconductor="" biocparallel=""> > > so that might provide a place to focus this development; I'd encourage use > of github and it's social coding as the primary means for development at > this time. > > Martin > > > >> Michael >> >> [[alternative HTML version deleted]] >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> > > -- > Dr. Martin Morgan, PhD > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > [[alternative HTML version deleted]]

ADD REPLY • link 11.4 years ago Michael Lawrence ★ 11k

0

Entering edit mode

On Wed, Nov 14, 2012 at 5:06 PM, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: > Thanks for setting this up. I think we might want to look into how other > high-level languages have approached these issues. The user will need some > high-level control. For example, only the user is going to know how much > memory a job will consume. I'm sure there are heuristics and simplifying > assumptions/constraints that will go a long way towards autonomy though. I've got ten bucks on Michael coming back 2-3 weeks from now with his own bioakkaR library: http://akka.io Who's in? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 11.4 years ago Steve Lianoglou ★ 13k

Login before adding your answer.