Hi,
I'm writing some code to submit to a Torque cluster using BiocParallel. However, it appears that long running jobs do not "wait" and eventually throw an error (I've listed below). bplapply *does* submit to the cluster and I'm able to see the jobs via qstat and running on the node. However, it appears that there is some kind of timeout where the "waiting" aspect is not waiting long enough. I'm running an example function that I've modified to run about 10k times while I've tried to debug this. Any ideas?
It appears the timeout is hard-coded for Torque in BatchJobsParam and there's no way for me to change that.
Code and error below:
library("BatchJobs")
torque.functions <- makeClusterFunctionsTorque(
x@templatefile,
list.jobs.cmd=c("qstat")
)
bpparam <- BatchJobsParam(workers="2",
jobname="test",
resources=list(nodes="4:ppn=2", vmem="20gb"),
cluster.functions=torque.functions,
progressbar=TRUE)
register(bpparam)
FUN <- function(i) system("hostname", intern=TRUE)
xx <- bplapply(1:10000, FUN)
print("RESULTS:")
print(table(unlist(xx)))
SubmitJobs |+++++++++++++++++++++++++++++++++++++++++++++++++| 100% (00:00:00) Waiting [S:10000 D:0 E:0 R:0] |+ | 0% (00:00:00) Error in getResults(reg, ids, part, missing.ok) : Some job result files do not exist, showing up to first 10: /Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/43/143-result.RData /Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/45/145-result.RData /Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/52/152-result.RData /Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/54/154-result.RData /Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp/BiocParallel_tmp_12c351d5a1510/jobs/56/156-result.RData /Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/58/158-result.RData /Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/59/159-result.RData /Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/60/160-result.RData /Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/61/161-result.RData /Biomarker/ngs/software/ngs_utils/ngs_utils/R/temp//BiocParallel_tmp_12c351d5a1510/jobs/62/162-result.RData Calls: main ... bplapply -> bplapply -> loadResults -> getResults -> stopf Execution halted
> sessionInfo() R version 3.2.2 (2015-08-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] ggplot2_2.0.0 DESeq2_1.10.1 [3] RcppArmadillo_0.6.500.4.0 Rcpp_0.12.3 [5] pheatmap_1.0.8 RColorBrewer_1.1-2 [7] BiocParallel_1.4.3 GenomicAlignments_1.6.3 [9] Rsamtools_1.22.0 Biostrings_2.38.3 [11] XVector_0.10.0 SummarizedExperiment_1.0.2 [13] biomaRt_2.26.1 GenomicFeatures_1.22.12 [15] AnnotationDbi_1.32.3 Biobase_2.30.0 [17] GenomicRanges_1.22.4 GenomeInfoDb_1.6.3 [19] IRanges_2.4.6 S4Vectors_0.8.11 [21] BiocGenerics_0.16.1 BatchJobs_1.6 [23] BBmisc_1.9 loaded via a namespace (and not attached): [1] genefilter_1.52.1 locfit_1.5-9.1 splines_3.2.2 [4] lattice_0.20-33 colorspace_1.2-6 rtracklayer_1.30.2 [7] base64enc_0.1-3 XML_3.98-1.3 survival_2.38-3 [10] foreign_0.8-66 DBI_0.3.1 lambda.r_1.1.7 [13] plyr_1.8.3 stringr_1.0.0 zlibbioc_1.16.0 [16] munsell_0.4.2 gtable_0.1.2 futile.logger_1.4.1 [19] latticeExtra_0.6-26 geneplotter_1.48.0 acepack_1.3-3.3 [22] xtable_1.8-2 scales_0.3.0 checkmate_1.7.1 [25] Hmisc_3.17-1 annotate_1.48.0 sendmailR_1.2-1 [28] gridExtra_2.0.0 brew_1.0-6 fail_1.3 [31] digest_0.6.9 stringi_1.0-1 grid_3.2.2 [34] tools_3.2.2 bitops_1.0-6 magrittr_1.5 [37] RCurl_1.95-4.7 RSQLite_1.0.0 Formula_1.2-1 [40] cluster_2.0.3 futile.options_1.0.0 rpart_4.1-10 [43] nnet_7.3-12
Hi Todd -- are you sure that this is a timeout / error with BiocParallel? There is a timeout in your version of BiocParallel, but it is
timeout=Inf
which is a pretty long time (in the devel version, the timeout is 30 days, which is also a long time)! To debug, it might help to take BiocParallel out of the picture, using BatchJobs more-or-less directly, along the lines ofMy suspicion is that the large number of small jobs overwhelms either the file system or database, requiring more careful management from BatchJobs. Please let me know what you find out.
Hi Martin,
I think you're right. In fact, I'm already going down that path. Your code will be a big help in ironing this out. I'll post back here when I get something working.
-todd