RFC: Checkpoint-Restart for R/BioConductor
1
0
Entering edit mode
gene • 0
@gene-9557
Last seen 8.7 years ago
USA/Boston/Northeastern University

Hello Everyone,

The R language currently allows the ability through save.image(), for saving all objects in a workspace.  But what if you are in the middle of a long-running computation in R, and you're worrying about the computer crashing?  Wouldn't it be nice if that computation restarted from the point that it failed, and continue to complete the computation?

Our group has developed and implemented the DMTCP (Distributed MultiThreaded Checkpointing) concepts for more than a decade, which is widely-accessed and adopted, and currently is at version 2.4.3.  It allows for checkpoint-restart of Linux processes (such as an R session), while the calculations are still processing.

 DMTCP information is here:

    http://dmtcp.sourceforge.net

Building DMTCP is as easy as untar/configure/make.  Below is a simple example of how to run R through the DMTCP wrapper:

   $ dmctp_launch --interval 300 R
      # This session will start R where one would proceed with the computation;
      # In this session, at every 300 seconds (5 minutes), it will save:
      #    1) A checkpoint image file and
      #    2) A dmtcp_restart_script.sh in the current directory.
   *** CRASH! *** ( Let's assume the computer crashes, and one then
reboots.)

   # To restart the computation at the last checkpoint, R is launched as follows
:
   $ ./dmtcp_restart_script.sh

As the BioConductor community is one of the most diverse and largest users of R, we would like to get an idea if people would find these features helpful.  We would be more than glad to help the R/BioCondutor community in creating a package that implements these concepts.  We would also be happy to answer any questions you might have.  If you would like more details on DMTCP, feel free to look through the questions/answers in the DMTCP FAQ ( http://dmtcp.sourceforge.net/FAQ.html ) or you can just ask your questions here.

We also have a DMTCP forum, as well as other venues to provide a friendly way to get further help from the DMTCP team:
    http://dmtcp.sourceforge.net/contactUs.html

We look forward to your comments.

Best wishes,
- Gene Cooperman

dmtcp • 1.3k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 10 weeks ago
United States

The BiocParallel package represents one way of parallel processing. In the 'devel' version, a 'typical' error would be handled and recovered with

> x <- list(1, "two", 3)
> res <- bptry(bplapply(x, sqrt))    # catch rather than signal errors
> res
[[1]]
[1] 1

[[2]]
<remote_error in="" fun(...):="" non-numeric="" argument="" to="" mathematical="" function="">
traceback() available as 'attr(x, "traceback")'

[[3]]
[1] 1.732051

> x <- list(1, 2, 3)                # correct input / recover from error
> bplapply(x, sqrt, BPREDO=res)     # redo just the error calculation
resuming previous calculation ...
[[1]]
[1] 1

[[2]]
[1] 1.414214

[[3]]
[1] 1.732051

so I was wondering how this would work with check-pointing? It seems like one could return information about where checkpointing information was available as part of the error condition, and use that as part of error recover with BPREDO.

ADD COMMENT
0
Entering edit mode

Hi Martin. 

The BiocParallel package is excellent for running many short programs.  (In your example, the short program is the function "sqrt".)

DMTCP fills a somewhat different need.  It specializes in running a single long program.

As an example, suppose we have a program in R, primes(), instead of sqrt().  Suppose the following R code is place in a file, primes.R .  (I'm not an expert in R.  My apologies if my example is not a clean one.)

===========================================

maxPrime <- 1e6

system("rm -f primes.dat")
primes <- integer(1e6)
primes[1] <- 2
lastPrime <- 1
round(sqrt(lastPrime),0)

isPrime <- function(n) {
  for (i in 1:round(sqrt(lastPrime),0)) {
    if (n %% primes[i] == 0) return(FALSE)
  }
  return(TRUE)
}

for (i in 2:maxPrime) {
  if (isPrime(i)) {
    lastPrime = lastPrime + 1
    primes[lastPrime] = i
    write(i, file="primes.dat", append=TRUE)
  }
}

===========================================

Next, build a local copy of DMTCP, to demonstrate how this works.

> tar zxvf dmtcp.tar.gz

> cd dmtcp && ./configure && make -j

Then, in a new directory, copy the file primes.R to the new directory.  Within the new directory, then do the following.  (Note that "-i 30" means to checkpoint at an interval of every 30 seconds.)

> path_to_DMTCP_root_dir/bin/dmtcp_launch -i 30 Rscript primes.R

From a different window, you can do:  "ls -l" and watch the file primes.dat grow.

Whenever you like, kill the Rscript process.  In that directory, you will see a file called "dmtcp_restart_script.sh".  From within the same directory, now do:

> ./dmtcp_restart_script.sh -i 30

You will see the file "primes.dat" continue to grow with more primes.  DMTCP will automatically remember the size of the file "primes.dat" at the time of checkpoint.  It will automatically truncate that file back to the pre-checkpoint size, and then resume the computation, appending to that file.  As before, it will checkpoint every 30 seconds.  If you kill it, you can resume the computation automatically from the last checkpoint.

Hopefully, this example shows where DMTCP could be helpful.  It is not intended to replace "BiocParallel".  Instead, it can be helpful in a complementary fashion.

"BiocParallel" already has a "BPREDO" mode to compute those cases that failed.  So, when a computation can be split into a parallel loop, "BiocParallel" would be a better choice, since it was designed for ease of use for such loops.  But sometimes, one has either a single, long program, or else a long parallel program that can't be easily split into a parallel loop.  DMTCP will correctly handle both sequential and parallel programs.

Also, if the computer crashes in middle of a BiocParallel computation, I'm not sure how easy it is to restart the BiocParallel computation.  With DMTCP, one simply invokes "dmtcp_restart_script.sh".

Best,

- Gene

ADD REPLY

Login before adding your answer.

Traffic: 663 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6