Dear DESeq2 Development Team,
In the last days I realized that the DESeq() function seems to create multiple threads for the R process. I thought this might be dependent on the parameter flag "parallel" and its setting, but it still continues to do this while parallel = FALSE. Even though this is not a serious bug, because the threads stay on listening and dont use up any computation time whatsoever, there are some things which bug me and I hope you can shed some light upon these:
1.) Sometimes these threads of the process don't shut down properly if the parent process is closed, which leads to an accumulation of lost threads.
2.) With htop it seems that each process receives full number of workers as threads even if parallel=FALSE, which seems a little odd to me. Is this intended? If I set the MulticoreParam as default and on lets say 4 workers upfront, it uses this number. For me this totally makes sense in an parallel = TRUE scenario.
3.) When testing parallel=T vs parallel=F I realized that parallel T stops doing anything after
"gene-wise dispersion estimates: 6 workers". All the threads are in Sleep.
Here is the test case I used:
library("airway") library(DESeq2)
data("airway") se <- airway
dds <- DESeqDataSet(se, design = ~ cell + dex) dds<-DESeq(dds,parallel=F)
As well as the MulticoreParam settings as well as the output of bpparam():
> register(MulticoreParam(6,log=TRUE),default=TRUE)
> bpparam()
class: MulticoreParam
bpisup: FALSE; bpworkers: 6; bptasks: 0; bpjobname: BPJOB
bplog: TRUE; bpthreshold: INFO; bpstopOnError: TRUE
bptimeout: 2592000; bpprogressbar: FALSE
bpRNGseed:
bplogdir: NA
bpresultdir: NA
cluster type: FORK
sessionInfo() R version 3.3.0 (2016-05-03) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 8 (jessie) locale: [1] C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] BiocParallel_1.6.0 DESeq2_1.12.0 [3] airway_0.105.0 SummarizedExperiment_1.2.0 [5] Biobase_2.32.0 GenomicRanges_1.24.0 [7] GenomeInfoDb_1.8.0 IRanges_2.6.0 [9] S4Vectors_0.10.0 BiocGenerics_0.18.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.4.5 RColorBrewer_1.1-2 plyr_1.8.3 [4] XVector_0.12.0 tools_3.3.0 zlibbioc_1.18.0 [7] rpart_4.1-10 RSQLite_1.0.0 annotate_1.50.0 [10] gtable_0.2.0 lattice_0.20-33 Matrix_1.2-6 [13] DBI_0.4 gridExtra_2.2.1 genefilter_1.54.0 [16] cluster_2.0.4 locfit_1.5-9.1 grid_3.3.0 [19] nnet_7.3-12 data.table_1.9.6 AnnotationDbi_1.34.0 [22] XML_3.98-1.4 survival_2.39-2 foreign_0.8-66 [25] latticeExtra_0.6-28 Formula_1.2-1 geneplotter_1.50.0 [28] ggplot2_2.1.0 Hmisc_3.17-4 scales_0.4.0 [31] splines_3.3.0 colorspace_1.2-6 xtable_1.8-2 [34] acepack_1.3-3.3 munsell_0.4.3 chron_2.3-47
Greetings Frank
Actually in my first test I didn't used register() only later to make sure i wont eat up all resources on our computing server. Is register called by default maybe? If i call the airway example with parallel=F and without registering upfront, it will allocate anyway, as if I called register. So it seems they are allocated anyway, but just dont get jobs if parallel=F.
Thank you for adding the BiocParallel tag, I forgot that in the beginning.
Maybe you have redefined
F = TRUE
;) (the best practice is to use the full names FALSE, TRUE which are immutable symbols, otherwise you have a variable named 'F', and variables can be redefined...). But trying to be helpful, I don't see any of these behaviors on the release branch. However, current packages are generally more recent than you indicate --BiocInstaller::biocValid()
and thebiocLite()
will update.register(SerialParam())
will make BiocParallel code run serially, as a work-around?Eventhough I made sure to run this minimal example from above in a clean R session, I reran the case i reported with FALSE instead of F and also with register on SerialParam():
Still htop shows multiple threads on the exec/R process. They start and instantly go to sleep. So yes this does not perform parallel actions.. but this also was the case when I used explicit parallel=FALSE in the DESeq call.
So my problem is, that I dont understand, why there are threads created eventhough they are not needed at all. Seems pretty counterintuitive to me, but I hope getting enlighted here.
This is nothing major I guess, because no additional resources are used, but again the multi threads seem to somehow impact the scheduler. So for example consider:
Computing server with 32 cores and 16 cores free for work. It seems that the created but not used threads seem to impact the scheduling of the process. It takes a lot longer than in situations where there are no cores used by others at all. Also RAM isn't an issue. Naturally I would assume, that if there are 16 Cores available and I only instruct using serial, that this process atleast gets some attention instead of staying 30-40% of the time in sleep.
Anyway I just wanted to make sure, that its not a possible inconsistency. As you state this example above is not recreatable for you, I assume that the error is on my side (server or clientwise). Did you checked the threads with htop?
Yes, I used htop. Did you update your packages? You must updated your R packages with biocLite(), to their current release version, otherwise we run the risk of spending a lot of time finding a problem that has already been fixed. Also, if I or others can't reproduce your problem then it'll have to be on you to make enough progress that we can reproduce the problem or otherwise shed some light. It's also important to be very precise, so . So...
After updating your packages, try minimal examples, e.g., does
have no consequence for the number of running threads (the answer should be 'yes')? Does
start 4 processes (answer should be no)? Does
run for approximately 5 seconds, with 4 sub processes started and cleaned up? Does
run for 20 seconds with a single thread?
If the above seem ok (the only variant that I am uncertain of is the clean-up of 'defunct' threads) then we can tentative rule out BiocParallel per se, and start at the other end with DESeq2. I would start by debugging the top-level function and confirming that the correct, non-parallel, code path is taken
debug(DESeq)
and step through the most surprising case (threads, when using
SerialParam()
)When you encounter something unexpected, try to figure out what is going wrong, eg.,
Browse[2]> parallel # is 'parallel' somehow not what is expected, e.g., a typo?
[1] FALSE # nope!
Browse[2]>
If you need to delve deeper, then typically stop debugging with
and start again with the next function
I'll make sure to test all this on the updated packages and come back to the thread when its done. Thank you very much for your suggestions and time :)
Just to add on to Martin's suggestions: I'd want to know if you see the processes spawned when you do:
library(BiocParallel)
Take a look at the BiocParallel vignette, which helps you investigate what the default registration is. As Martin said, it will be easier if we can separate the behavior you see (if it persists using the current release versions), from DESeq2, which is just using BiocParallel's bplapply() under the hood. With the current release I do not see this behavior when I load DESeq2 or run DESeq().