Hello Everyone,
The R language currently allows the ability through save.image(), for saving all objects in a workspace. But what if you are in the middle of a long-running computation in R, and you're worrying about the computer crashing? Wouldn't it be nice if that computation restarted from the point that it failed, and continue to complete the computation?
Our group has developed and implemented the DMTCP (Distributed MultiThreaded Checkpointing) concepts for more than a decade, which is widely-accessed and adopted, and currently is at version 2.4.3. It allows for checkpoint-restart of Linux processes (such as an R session), while the calculations are still processing.
DMTCP information is here:
Building DMTCP is as easy as untar/configure/make. Below is a simple example of how to run R through the DMTCP wrapper:
$ dmctp_launch --interval 300 R
# This session will start R where one would proceed with the computation;
# In this session, at every 300 seconds (5 minutes), it will save:
# 1) A checkpoint image file and
# 2) A dmtcp_restart_script.sh in the current directory.
*** CRASH! *** ( Let's assume the computer crashes, and one then
reboots.)
# To restart the computation at the last checkpoint, R is launched as follows
:
$ ./dmtcp_restart_script.sh
As the BioConductor community is one of the most diverse and largest users of R, we would like to get an idea if people would find these features helpful. We would be more than glad to help the R/BioCondutor community in creating a package that implements these concepts. We would also be happy to answer any questions you might have. If you would like more details on DMTCP, feel free to look through the questions/answers in the DMTCP FAQ ( http://dmtcp.sourceforge.net/FAQ.html ) or you can just ask your questions here.
We also have a DMTCP forum, as well as other venues to provide a friendly way to get further help from the DMTCP team:
http://dmtcp.sourceforge.net/contactUs.html
We look forward to your comments.
Best wishes,
- Gene Cooperman
Hi Martin.
The BiocParallel package is excellent for running many short programs. (In your example, the short program is the function "sqrt".)
DMTCP fills a somewhat different need. It specializes in running a single long program.
As an example, suppose we have a program in R, primes(), instead of sqrt(). Suppose the following R code is place in a file, primes.R . (I'm not an expert in R. My apologies if my example is not a clean one.)
===========================================
maxPrime <- 1e6
system("rm -f primes.dat") primes <- integer(1e6) primes[1] <- 2 lastPrime <- 1 round(sqrt(lastPrime),0)
isPrime <- function(n) { for (i in 1:round(sqrt(lastPrime),0)) { if (n %% primes[i] == 0) return(FALSE) } return(TRUE) }
for (i in 2:maxPrime) { if (isPrime(i)) { lastPrime = lastPrime + 1 primes[lastPrime] = i write(i, file="primes.dat", append=TRUE) } }
===========================================
Next, build a local copy of DMTCP, to demonstrate how this works.
Then, in a new directory, copy the file primes.R to the new directory. Within the new directory, then do the following. (Note that "
-i 30
" means to checkpoint at an interval of every 30 seconds.)From a different window, you can do: "
ls -l" and watch the file primes.dat grow.
Whenever you like, kill the Rscript process. In that directory, you will see a file called "
dmtcp_restart_script.sh
". From within the same directory, now do:You will see the file "primes.dat" continue to grow with more primes. DMTCP will automatically remember the size of the file "primes.dat" at the time of checkpoint. It will automatically truncate that file back to the pre-checkpoint size, and then resume the computation, appending to that file. As before, it will checkpoint every 30 seconds. If you kill it, you can resume the computation automatically from the last checkpoint.
Hopefully, this example shows where DMTCP could be helpful. It is not intended to replace "BiocParallel". Instead, it can be helpful in a complementary fashion.
"BiocParallel" already has a "BPREDO" mode to compute those cases that failed. So, when a computation can be split into a parallel loop, "BiocParallel" would be a better choice, since it was designed for ease of use for such loops. But sometimes, one has either a single, long program, or else a long parallel program that can't be easily split into a parallel loop. DMTCP will correctly handle both sequential and parallel programs.
Also, if the computer crashes in middle of a BiocParallel computation, I'm not sure how easy it is to restart the BiocParallel computation. With DMTCP, one simply invokes
"dmtcp_restart_script.sh"
.Best,
- Gene