Question: Can bplapply timeout when waiting for long jobs?
0
gravatar for Todd Creasy
3.5 years ago by
Todd Creasy0 wrote:

Hi,

I'm writing some code to submit to a Torque cluster using BiocParallel. However, it appears that long running jobs do not "wait" and eventually throw an error (I've listed below). bplapply *does* submit to the cluster and I'm able to see the jobs via qstat and running on the node. However, it appears that there is some kind of timeout where the "waiting" aspect is not waiting long enough. I'm running an example function that I've modified to run about 10k times while I've tried to debug this. Any ideas? 

It appears the timeout is hard-coded for Torque in BatchJobsParam and there's no way for me to change that.

Code and error below:

    library("BatchJobs")

    torque.functions <- makeClusterFunctionsTorque(
        x@templatefile,
        list.jobs.cmd=c("qstat")
    )

    bpparam <- BatchJobsParam(workers="2",
                            jobname="test",
                            resources=list(nodes="4:ppn=2", vmem="20gb"),
                            cluster.functions=torque.functions,
                            progressbar=TRUE)
    register(bpparam)

    FUN <- function(i) system("hostname", intern=TRUE)
    xx <- bplapply(1:10000, FUN)

    print("RESULTS:")
    print(table(unlist(xx)))

 

SubmitJobs |+++++++++++++++++++++++++++++++++++++++++++++++++| 100% (00:00:00)
Waiting [S:10000 D:0 E:0 R:0] |+                                 |   0% (00:00:00)
Error in getResults(reg, ids, part, missing.ok) :
  Some job result files do not exist, showing up to first 10:
/Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/43/143-result.RData
/Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/45/145-result.RData
/Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/52/152-result.RData
/Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/54/154-result.RData
/Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp/BiocParallel_tmp_12c351d5a1510/jobs/56/156-result.RData
/Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/58/158-result.RData
/Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/59/159-result.RData
/Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/60/160-result.RData
/Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/61/161-result.RData
/Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/62/162-result.RData
Calls: main ... bplapply -> bplapply -> loadResults -> getResults -> stopf
Execution halted

 

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] ggplot2_2.0.0              DESeq2_1.10.1
 [3] RcppArmadillo_0.6.500.4.0  Rcpp_0.12.3
 [5] pheatmap_1.0.8             RColorBrewer_1.1-2
 [7] BiocParallel_1.4.3         GenomicAlignments_1.6.3
 [9] Rsamtools_1.22.0           Biostrings_2.38.3
[11] XVector_0.10.0             SummarizedExperiment_1.0.2
[13] biomaRt_2.26.1             GenomicFeatures_1.22.12
[15] AnnotationDbi_1.32.3       Biobase_2.30.0
[17] GenomicRanges_1.22.4       GenomeInfoDb_1.6.3
[19] IRanges_2.4.6              S4Vectors_0.8.11
[21] BiocGenerics_0.16.1        BatchJobs_1.6
[23] BBmisc_1.9

loaded via a namespace (and not attached):
 [1] genefilter_1.52.1    locfit_1.5-9.1       splines_3.2.2
 [4] lattice_0.20-33      colorspace_1.2-6     rtracklayer_1.30.2
 [7] base64enc_0.1-3      XML_3.98-1.3         survival_2.38-3
[10] foreign_0.8-66       DBI_0.3.1            lambda.r_1.1.7
[13] plyr_1.8.3           stringr_1.0.0        zlibbioc_1.16.0
[16] munsell_0.4.2        gtable_0.1.2         futile.logger_1.4.1
[19] latticeExtra_0.6-26  geneplotter_1.48.0   acepack_1.3-3.3
[22] xtable_1.8-2         scales_0.3.0         checkmate_1.7.1
[25] Hmisc_3.17-1         annotate_1.48.0      sendmailR_1.2-1
[28] gridExtra_2.0.0      brew_1.0-6           fail_1.3
[31] digest_0.6.9         stringi_1.0-1        grid_3.2.2
[34] tools_3.2.2          bitops_1.0-6         magrittr_1.5
[37] RCurl_1.95-4.7       RSQLite_1.0.0        Formula_1.2-1
[40] cluster_2.0.3        futile.options_1.0.0 rpart_4.1-10
[43] nnet_7.3-12
biocparallel • 755 views
ADD COMMENTlink modified 3.5 years ago by Martin Morgan ♦♦ 23k • written 3.5 years ago by Todd Creasy0
1

Hi Todd -- are you sure that this is a timeout / error with BiocParallel? There is a timeout in your version of BiocParallel, but it is timeout=Inf which is a pretty long time (in the devel version, the timeout is 30 days, which is also a long time)! To debug, it might help to take BiocParallel out of the picture, using BatchJobs more-or-less directly, along the lines of

library(BiocParallel); library(BatchJobs); library(BBmisc)

FUN <- function(i) system("hostname", intern=TRUE)
X <- 1:100
bpparam <- BatchJobsParam(workers="2", jobname="test",
    resources=list(nodes="4:ppn=2", vmem="20gb"), cluster.functions=torque.functions,
    progressbar=TRUE)
setConfig(conf = bpparam$conf.pars)
reg = makeRegistry(id = "test", file.dir = tempfile(), seed = 123)
ids <- batchMap(reg, FUN, X)
cids <- BBmisc::chunk(ids, n.chunks = bpworkers(bpparam), shuffle = TRUE)
submitJobs(reg, cids, resources=bpparam$resources)
waitForJobs(reg, ids, stop.on.error=bpstopOnError(bpparam))
res <- loadResults(reg, ids, use.names = "none")
table(unlist(res))

My suspicion is that the large number of small jobs overwhelms either the file system or database, requiring more careful management from BatchJobs. Please let me know what you find out.

ADD REPLYlink written 3.5 years ago by Martin Morgan ♦♦ 23k

Hi Martin,

I think you're right. In fact, I'm already going down that path. Your code will be a big help in ironing this out. I'll post back here when I get something working.

 

-todd

ADD REPLYlink written 3.5 years ago by Todd Creasy0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 279 users visited in the last hour