Question

Can bplapply timeout when waiting for long jobs?

0

Entering edit mode

Todd Creasy • 0

@todd-creasy-9699

Last seen 3.5 years ago

Hi,

I'm writing some code to submit to a Torque cluster using BiocParallel. However, it appears that long running jobs do not "wait" and eventually throw an error (I've listed below). bplapply *does* submit to the cluster and I'm able to see the jobs via qstat and running on the node. However, it appears that there is some kind of timeout where the "waiting" aspect is not waiting long enough. I'm running an example function that I've modified to run about 10k times while I've tried to debug this. Any ideas?

It appears the timeout is hard-coded for Torque in BatchJobsParam and there's no way for me to change that.

Code and error below:

    library("BatchJobs")

    torque.functions <- makeClusterFunctionsTorque(
        x@templatefile,
        list.jobs.cmd=c("qstat")
    )

    bpparam <- BatchJobsParam(workers="2",
                            jobname="test",
                            resources=list(nodes="4:ppn=2", vmem="20gb"),
                            cluster.functions=torque.functions,
                            progressbar=TRUE)
    register(bpparam)

    FUN <- function(i) system("hostname", intern=TRUE)
    xx <- bplapply(1:10000, FUN)

    print("RESULTS:")
    print(table(unlist(xx)))

SubmitJobs |+++++++++++++++++++++++++++++++++++++++++++++++++| 100% (00:00:00)
Waiting [S:10000 D:0 E:0 R:0] |+                                 |   0% (00:00:00)
Error in getResults(reg, ids, part, missing.ok) :
  Some job result files do not exist, showing up to first 10:
/Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/43/143-result.RData
/Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/45/145-result.RData
/Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/52/152-result.RData
/Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/54/154-result.RData
/Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp/BiocParallel_tmp_12c351d5a1510/jobs/56/156-result.RData
/Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/58/158-result.RData
/Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/59/159-result.RData
/Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/60/160-result.RData
/Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/61/161-result.RData
/Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/62/162-result.RData
Calls: main ... bplapply -> bplapply -> loadResults -> getResults -> stopf
Execution halted

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] ggplot2_2.0.0              DESeq2_1.10.1
 [3] RcppArmadillo_0.6.500.4.0  Rcpp_0.12.3
 [5] pheatmap_1.0.8             RColorBrewer_1.1-2
 [7] BiocParallel_1.4.3         GenomicAlignments_1.6.3
 [9] Rsamtools_1.22.0           Biostrings_2.38.3
[11] XVector_0.10.0             SummarizedExperiment_1.0.2
[13] biomaRt_2.26.1             GenomicFeatures_1.22.12
[15] AnnotationDbi_1.32.3       Biobase_2.30.0
[17] GenomicRanges_1.22.4       GenomeInfoDb_1.6.3
[19] IRanges_2.4.6              S4Vectors_0.8.11
[21] BiocGenerics_0.16.1        BatchJobs_1.6
[23] BBmisc_1.9

loaded via a namespace (and not attached):
 [1] genefilter_1.52.1    locfit_1.5-9.1       splines_3.2.2
 [4] lattice_0.20-33      colorspace_1.2-6     rtracklayer_1.30.2
 [7] base64enc_0.1-3      XML_3.98-1.3         survival_2.38-3
[10] foreign_0.8-66       DBI_0.3.1            lambda.r_1.1.7
[13] plyr_1.8.3           stringr_1.0.0        zlibbioc_1.16.0
[16] munsell_0.4.2        gtable_0.1.2         futile.logger_1.4.1
[19] latticeExtra_0.6-26  geneplotter_1.48.0   acepack_1.3-3.3
[22] xtable_1.8-2         scales_0.3.0         checkmate_1.7.1
[25] Hmisc_3.17-1         annotate_1.48.0      sendmailR_1.2-1
[28] gridExtra_2.0.0      brew_1.0-6           fail_1.3
[31] digest_0.6.9         stringi_1.0-1        grid_3.2.2
[34] tools_3.2.2          bitops_1.0-6         magrittr_1.5
[37] RCurl_1.95-4.7       RSQLite_1.0.0        Formula_1.2-1
[40] cluster_2.0.3        futile.options_1.0.0 rpart_4.1-10
[43] nnet_7.3-12

biocparallel • 1.3k views

ADD COMMENT • link updated 8.2 years ago by Martin Morgan 25k • written 8.2 years ago by Todd Creasy • 0

1

Entering edit mode

Hi Todd -- are you sure that this is a timeout / error with BiocParallel? There is a timeout in your version of BiocParallel, but it is timeout=Inf which is a pretty long time (in the devel version, the timeout is 30 days, which is also a long time)! To debug, it might help to take BiocParallel out of the picture, using BatchJobs more-or-less directly, along the lines of

library(BiocParallel); library(BatchJobs); library(BBmisc)

FUN <- function(i) system("hostname", intern=TRUE)
X <- 1:100

bpparam <- BatchJobsParam(workers="2", jobname="test",
    resources=list(nodes="4:ppn=2", vmem="20gb"), cluster.functions=torque.functions,
    progressbar=TRUE)

setConfig(conf = bpparam$conf.pars)
reg = makeRegistry(id = "test", file.dir = tempfile(), seed = 123)
ids <- batchMap(reg, FUN, X)
cids <- BBmisc::chunk(ids, n.chunks = bpworkers(bpparam), shuffle = TRUE)
submitJobs(reg, cids, resources=bpparam$resources)
waitForJobs(reg, ids, stop.on.error=bpstopOnError(bpparam))
res <- loadResults(reg, ids, use.names = "none")
table(unlist(res))

My suspicion is that the large number of small jobs overwhelms either the file system or database, requiring more careful management from BatchJobs. Please let me know what you find out.

ADD REPLY • link 8.2 years ago Martin Morgan 25k

0

Entering edit mode

Hi Martin,

I think you're right. In fact, I'm already going down that path. Your code will be a big help in ironing this out. I'll post back here when I get something working.

-todd

ADD REPLY • link 8.2 years ago Todd Creasy • 0