Question

DiffBind does not run in parallel

0

Entering edit mode

eggrandio • 0

@eggrandio-14403

Last seen 3.7 years ago

United States

Hi,

I am trying to run DiffBind using parallel execution but it does not detect multiple cores, although I can through:

> parallel::detectCores()
[1] 8

When I try to run DiffBind. Here's what I see with a test run:

> test = dba(sampleSheet = "TEST.xls")
wt_D_1 wt  D  1 bayes
wt_D_2 wt  D  2 bayes
> test.counts = dba.count(test, minOverlap=1)
Sample: 01_wt_D_1_BAM_MD_asBED.bed125 
Sample: 02_wt_D_2_BAM_MD_asBED.bed125 
Sample: Input_files/25_wt_D_INP_BAM_MD_asBED.bed125 
Warning message:
In dba.multicore.init(DBA$config) :
  Parallel execution unavailable: executing serially.

What should I do to run dba.count in parallel? It would reduce the analysis time a lot for me. If you need any more info or me to run any other diagnostic command, please ask.

Thanks!

PS: Here is my sessionInfo() output in case it helps

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C
[5] LC_TIME=Spanish_Spain.1252

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] DiffBind_2.4.8 SummarizedExperiment_1.6.5 DelayedArray_0.2.7 matrixStats_0.52.2 Biobase_2.36.2
[6] GenomicRanges_1.28.6 GenomeInfoDb_1.12.3 IRanges_2.10.5 S4Vectors_0.14.7 BiocGenerics_0.22.1

loaded via a namespace (and not attached):
[1] edgeR_3.18.1 bit64_0.9-7 splines_3.4.0 gtools_3.5.0 assertthat_0.2.0
[6] latticeExtra_0.6-28 amap_0.8-14 RBGL_1.52.0 blob_1.1.0 GenomeInfoDbData_0.99.0
[11] Rsamtools_1.28.0 ggrepel_0.7.0 Category_2.42.1 pillar_1.1.0 RSQLite_2.0
[16] backports_1.1.2 lattice_0.20-35 glue_1.2.0 limma_3.32.10 digest_0.6.14
[21] RColorBrewer_1.1-2 XVector_0.16.0 checkmate_1.8.5 colorspace_1.3-2 Matrix_1.2-12
[26] plyr_1.8.4 GSEABase_1.38.2 XML_3.98-1.9 pkgconfig_2.0.1 pheatmap_1.0.8
[31] ShortRead_1.34.2 biomaRt_2.32.1 genefilter_1.58.1 zlibbioc_1.22.0 xtable_1.8-2
[36] GO.db_3.4.1 scales_0.5.0 brew_1.0-6 gdata_2.18.0 BiocParallel_1.10.1
[41] tibble_1.4.1 annotate_1.54.0 ggplot2_2.2.1 GenomicFeatures_1.28.5 lazyeval_0.2.1
[46] XLConnect_0.2-13 magrittr_1.5 survival_2.41-3 memoise_1.1.0 systemPipeR_1.10.2
[51] gplots_3.0.1 hwriter_1.3.2 GOstats_2.42.0 graph_1.54.0 tools_3.4.0
[56] data.table_1.10.4-3 BBmisc_1.11 sendmailR_1.2-1 munsell_0.4.3 locfit_1.5-9.1
[61] bindrcpp_0.2 AnnotationDbi_1.38.2 Biostrings_2.44.2 compiler_3.4.0 caTools_1.17.1
[66] rlang_0.1.6 grid_3.4.0 RCurl_1.95-4.10 rjson_0.2.15 AnnotationForge_1.18.2
[71] base64enc_0.1-3 bitops_1.0-6 gtable_0.2.0 DBI_0.7 R6_2.2.2
[76] GenomicAlignments_1.12.2 dplyr_0.7.4 rtracklayer_1.36.6 bit_1.1-12 bindr_0.1
[81] XLConnectJars_0.2-13 KernSmooth_2.23-15 rJava_0.9-9 stringi_1.1.6 BatchJobs_1.7
[86] Rcpp_0.12.14

diffbind parallel biocparallel • 3.3k views

ADD COMMENT • link updated 8.0 years ago by Rory Stark ★ 5.2k • written 8.0 years ago by eggrandio • 0

score 1 · Answer 1 · 2018-01-19

1

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 13 months ago

Cambridge, UK

DiffBind uses the "parallel" package (built in to R) to run parallel jobs. This package unfortunately does not support parallel execution on the Windows platform. You'd need to use Linux or Mac OS to run in parallel.

-Rory

ADD COMMENT • link 8.0 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Thanks for the quick reply !

I do not know if it's appropiate, but instead of starting a new thread I wanted to ask you a follow-up question.

I have a huge dataset (32 files) and when I run the dba.count command, sometimes, it skips some files and doesnt count the reads. I have run it several times, and everytime the files that get "skipped" are different. I do not know if I'm running out of memory or what could be the cause for this behavior. I have resorted to run it until I get all the files read, but it is very time consuming.

This is an example of the message I obtain after running the dba.count :

Warning messages:

1: In dba.multicore.init(DBA$config) :

  Parallel execution unavailable: executing serially.

2: In DGEList(counts, lib.size = libsize, group = groups, genes = as.character(1:nrow(counts))) :

  library size of zero detected

3: In max(abs(logR)) : no non-missing arguments to max; returning -Inf

ADD REPLY • link 8.0 years ago eggrandio • 0

0

Entering edit mode

Hmmn. Since you are running serially anyway, is may be best to set bParallel=FALSE and see if that is any better.

This is a tough one to debug remotely. One thing you could try if you get desperate is to set:

> debug(DiffBind:::pv.do_getCounts)

and then type "c" whenever it stops.

ADD REPLY • link 8.0 years ago Rory Stark ★ 5.2k

0

Entering edit mode

I tried it with bParallel=FALSE but I am still getting files skipped. Sometimes it's just one, sometimes it's 10 of them.

I am running the debug. What should I look for? Does it give an automated report at the end?

Alternatively, I could add a line so it stops after finding any library with size 0. At least that way I do not have to wait until it has processed all the files to know if it has read them.

Thanks a lot for your help !

ADD REPLY • link 8.0 years ago eggrandio • 0

score 0 · Answer 2 · 2018-01-21

0

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 13 months ago

Cambridge, UK

No report, I was just thinking it would stop and change the timing.

-R

ADD COMMENT • link 8.0 years ago Rory Stark ★ 5.2k