Hello
I am working with a slightly customized Bioconductor AMI (Version 3.1), where I installed my own packages on. I am trying to create a bigger cluster - 50 spot-instances with 32 CPUs (c3.x8large)- on Amazon AWS (region: EU Ireland) with help of the pre-installed Starcluster and the parallel backends described in BiocAMI . The problem is, that it is not working.
Three backend options are described on the help page of the Bioconductor AMI and I am having problems with all of them, most importantly the SGE backend as I intended to use it
All of the following problems can be obtained with trying to execute the minimal examples described on the help page (see hyperlink above), yet using instances that have more than one CPU.
- MPI: Described as not working "rstudio initialization error: unable to connect to service" after logging in to the Master node's Rstudio Server's login page
- SSH: Returning an "system2" error when using "makeSSHWorker(nodename="nameofnode"), which I traced back to the function "runOScommandlinux".
- SGE: It is working, yet apparently does not recongize the CPUs which I specify with
param <- BatchJobsParam(50, resources=list(ncpus=32))
The reason I believe this, is a) the missing performance increase of using 50*32=1600 parallelized nodes and b) observing instance performance workload in the AWS console, I can see that only a small part of the instances CPU capacity is used.
Especially regarding the SGE backend, I would appreciate information or help. Have I reached a limit with this many instances and nodes? Does anyone have experience with this?
Thank you very much for any help in advance.
Kind regards,
Nikolai
Hi Nikolai,
I am looking into this. I started investigating issue 2 with the ssh clusters and have found the problem, but not sure yet what the solution is.
As for SGE I am not sure that looking at performance workload in the AWS console is the best way to determine whether all cores are being used.
What if instead you use the example on the AMI page (which calls
system("hostname")
on each node)but replacing the configuration ofparam
with the way you are already configuring it and changing1:100
to1:1600
.If things are working correctly you should see a list of 50 nodes with 32 jobs run on each one.
This doesn't tell us precisely if the jobs really used all cores on each node, I guess (but it at least tells us if each node in the cluster was used) -- for that you might need to know more about SGE than I do, perhaps under SGE each worker (that is, combo of node and CPU) has a unique ID that could be printed out? Anyone know?
OK, I have more info on the issue with ssh clusters. It has to do with the fact that the BatchJobs package is installed onto the AMI in a non-standard library directory. Then BatchJobs tries to ssh to each node in the cluster and call R to determine the location of BatchJobs on that node so that it can run a helper script. However, when you run a command on a remote machine with ssh, (in contrast to starting an interactive session) it does not read config files (such as ~/.bashrc) that set up your environment. So R can't find BatchJobs and everything fails.
The fix is for me to generate the AMIs going forward with BatchJobs installed in the default library directory. I will do this for the BioC 3.2 AMI after 3.2 is released on October 14 and for all new AMIs after that. I won't do it for old ones. And it sounds like you have already customized the AMI for your own needs, so here is how you can work around this issue:
- Start your AMI outside of StarCluster, either with the AWS console or using the
aws
command line tool.-
ssh
to the instance you have started (as theubuntu
user) and issue these commands:sudo R --vanilla
And then, in R:
install.packages("BatchJobs", repos="http://cran.rstudio.com/")
That will install BatchJobs in the default library location.
Then you can stop the instance and create a new AMI from it (then terminate the instance). Note the AMI ID and replace the AMI ID in your StarCluster config file with the new AMI ID.
Then you should be able to use an ssh cluster. If you run into any issues, post them here.
Hi Nikolai,
Here are the steps I followed to get a function running via BiocParallel (using SGE) . Please try to mimic this and tell us if you’re getting expected output or what’s failing:
Except, that your final table should include 50 entries rather than 2
Hi Brian,
thank you very much for your reply. I will definitely try out your solution the next time I am working with my StarCluster+Bioconductor setup and report any issues.
When I wrote my post I was under a bit of time pressure, so that I had to implement a dirty workaround to get it to work with the SGE backend. This workaround is described in my reply to Dan's comment.
Hi Dan,
thank you very much for your help. I will try out a SSH cluster, the next time I am working with my StarCluster+Bioconductor setup, and report any issues.
Regarding the SGE issue I reported above: I was in a lot of hurry to finish my simulation on that day, as they were part of my now finished Bachelor Thesis and I was way behind on schedule already. So I implemented an ad hoc version in which I made the BatchJobs parameter connect to only one core on all of the 50 instances and then on those 50 cores (each on a seperate instance) start a function which uses the foreach (from the foreach package) function made to work on the 32 cores on every machine. Here is an example code to understand it better:
Of course, I made sure that detectCores() actually detected all of the 32 cores on the 50 machines - which it did.
To my suprise it worked. Some hasty benchmarking showed that it was significantly faster than just using
and the CPU workload was at 100%. All this information is a bit subjective and not 100% indicative that all the 1600 cores are working, but it had to work for me and it did. One definit caveat lies in the functionality of
As if it does not work as intended in recognising the 32 cores on the 50 machines, I cannot guarantee that it works perfectly with my ad hoc setup. Not that I advise anyone to use it.