Using DESeq2 v1.22.2 with Bioconductor version 3.8 and R 3.5.1, the following command consistently completes in fewer than 3 minutes:
numWorkers <- 20
dds <- DESeq(dds, parallel=TRUE, BPPARAM=MulticoreParam(numWorkers))
dds
is a 23,680 x 272 DESeqDataSet derived from a subset of TCGA HTSeq count matrices.
Using DESeq2 v1.24.0 with Bioconductor version 3.9 and R 3.6.0, the same command (on the same dataset) gets hung at the "gene-wise dispersion estimates: 20 workers" step. On my HPC, R/3.6.0 requires openmpi/3.1.4, so I think that was part of the problem.
Using DESeq2 v1.26.0 with Bioconductor version 3.10 and R 3.6.1 (without openmpi/3.1.4), the same command (on the same dataset) has erratic behavior; generally, the command gets hung at the "gene-wise dispersion estimates: 20 workers" step. If I provide a row-limited count matrix to the command (e.g. with, for example, dds[1:3000,]
), the command usually runs to completion but sometimes gets hung at the "final dispersion estimates, fitting model and testing: 20 workers" step. The memory usage is also erratic, sometimes completing the command with minimal memory (e.g. 3GB) and other times requiring >50GB.
My question is this: has something changed in recent versions of Bioconductor or DESeq2 that would help explain the difference in performance from DESeq2 v1.22.2 with Bioconductor version 3.8 and R 3.5.1?
I would like to troubleshoot this issue on my own, but I lack the expertise to debug this problem, so I am looking for guidance on how to proceed. I am aware of this reply https://support.bioconductor.org/p/107481/, and I'm happy to use limma-voom or edgeR (or my reliable outdated version of DESeq2), but I'm interested in the performance change from the prior version of DESeq2.
I simulate my data with the following toy count matrix, and the process completes using both DESeq2 v1.22.2 in R/3.5.1 and DESeq2 v1.26.0 in R/3.6.1, although it did hang indefinitely with DESeq2 v1.24.0 when running in R/3.6.0 (v1.26.0 in R/3.6.0 runs with an error and does not hang, as shown below):
## load the library in question
library(DESeq2)
## set toy size (i.e. for the matrix to be created)
sample_size <- 30
gene_number <- 1000
## create a toy count matrix
sample_names <- paste(rep("sample"),1:sample_size,sep="_")
toy_counts <- MASS::rnegbin(gene_number, theta=1)
for (i in 1:(length(sample_names)-1)) {
new_toy_counts <- MASS::rnegbin(gene_number, theta=1)
toy_counts <- cbind(toy_counts, new_toy_counts)
}
toy_count_matrix <- data.frame(toy_counts)
colnames(toy_count_matrix) <- sample_names
rownames(toy_count_matrix) <- paste(rep("gene_id"),1:nrow(toy_count_matrix),sep="_")
## create a toy annotation
toy_col_data <- data.frame(sample_name=sample_names,category=c(rep("A",length(sample_names)/2),rep("B",length(sample_names)/2)))
## create a toy dds
dds <- DESeqDataSetFromMatrix(toy_count_matrix, toy_col_data, ~category)
## test execution time
system.time(DESeq(dds, parallel=TRUE, BPPARAM=MulticoreParam(4)))
Here's the output from the last command with R/3.5.1:
estimating size factors
estimating dispersions
gene-wise dispersion estimates: 4 workers
mean-dispersion relationship
-- note: fitType='parametric', but the dispersion trend was not well captured by the
function: y = a/x + b, and a local regression fit was automatically substituted.
specify fitType='local' or 'mean' to avoid this message next time.
final dispersion estimates, fitting model and testing: 4 workers
-- replacing outliers and refitting for 1 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
user system elapsed
4.009 1.391 5.408
Here's my session info for R/3.5.1:
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /sfs/applications/201912/software/standard/compiler/gcc/7.1.0/R/3.5.1/lib64/R/lib/libRblas.so
LAPACK: /sfs/applications/201912/software/standard/compiler/gcc/7.1.0/R/3.5.1/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] DESeq2_1.22.2 SummarizedExperiment_1.12.0
[3] DelayedArray_0.8.0 BiocParallel_1.16.6
[5] matrixStats_0.55.0 Biobase_2.42.0
[7] GenomicRanges_1.34.0 GenomeInfoDb_1.18.2
[9] IRanges_2.16.0 S4Vectors_0.20.1
[11] BiocGenerics_0.28.0
loaded via a namespace (and not attached):
[1] bit64_0.9-7 splines_3.5.1 Formula_1.2-3
[4] assertthat_0.2.1 latticeExtra_0.6-28 blob_1.2.0
[7] GenomeInfoDbData_1.2.0 pillar_1.4.2 RSQLite_2.1.2
[10] backports_1.1.5 lattice_0.20-38 glue_1.3.1
[13] digest_0.6.22 RColorBrewer_1.1-2 XVector_0.22.0
[16] checkmate_1.9.4 colorspace_1.4-1 htmltools_0.4.0
[19] Matrix_1.2-17 XML_3.98-1.20 pkgconfig_2.0.3
[22] genefilter_1.64.0 zlibbioc_1.28.0 purrr_0.3.3
[25] xtable_1.8-4 scales_1.0.0 htmlTable_1.13.2
[28] tibble_2.1.3 annotate_1.60.1 ggplot2_3.2.1
[31] nnet_7.3-12 lazyeval_0.2.2 survival_2.44-1.1
[34] magrittr_1.5 crayon_1.3.4 memoise_1.1.0
[37] MASS_7.3-51.4 foreign_0.8-72 tools_3.5.1
[40] data.table_1.12.6 stringr_1.4.0 locfit_1.5-9.1
[43] munsell_0.5.0 cluster_2.1.0 AnnotationDbi_1.44.0
[46] compiler_3.5.1 rlang_0.4.1 grid_3.5.1
[49] RCurl_1.95-4.12 rstudioapi_0.10 htmlwidgets_1.5.1
[52] bitops_1.0-6 base64enc_0.1-3 gtable_0.3.0
[55] DBI_1.0.0 R6_2.4.0 gridExtra_2.3
[58] knitr_1.25 dplyr_0.8.3 bit_1.1-14
[61] zeallot_0.1.0 Hmisc_4.2-0 stringi_1.4.3
[64] Rcpp_1.0.2 vctrs_0.2.0 geneplotter_1.60.0
[67] rpart_4.1-15 acepack_1.4.1 tidyselect_0.2.5
[70] xfun_0.10
Here's the output from the last command with R/3.6.0:
estimating size factors
estimating dispersions
gene-wise dispersion estimates: 4 workers
Error in mcexit(0L) : ignoring SIGPIPE signal
Error in mcexit(0L) : ignoring SIGPIPE signal
Error in mcexit(0L) : ignoring SIGPIPE signal
mean-dispersion relationship
final dispersion estimates, fitting model and testing: 4 workers
Error in mcexit(0L) : ignoring SIGPIPE signal
Error in mcexit(0L) : ignoring SIGPIPE signal
Error in mcexit(0L) : ignoring SIGPIPE signal
Error in mcexit(0L) : ignoring SIGPIPE signal
user system elapsed
5.612 1.453 7.072
Here's my session info for R/3.6.0:
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /sfs/applications/201912/software/standard/compiler/gcc/7.1.0/openblas/0.2.19/lib/libopenblas_sandybridgep-r0.2.19.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] DESeq2_1.26.0 SummarizedExperiment_1.16.1
[3] DelayedArray_0.12.2 BiocParallel_1.20.1
[5] matrixStats_0.55.0 Biobase_2.46.0
[7] GenomicRanges_1.38.0 GenomeInfoDb_1.22.0
[9] IRanges_2.20.2 S4Vectors_0.24.3
[11] BiocGenerics_0.32.0
loaded via a namespace (and not attached):
[1] bit64_0.9-7 splines_3.6.0 Formula_1.2-3
[4] assertthat_0.2.1 latticeExtra_0.6-29 blob_1.2.1
[7] GenomeInfoDbData_1.2.2 pillar_1.4.3 RSQLite_2.2.0
[10] backports_1.1.5 lattice_0.20-38 glue_1.3.1
[13] digest_0.6.23 RColorBrewer_1.1-2 XVector_0.26.0
[16] checkmate_1.9.4 colorspace_1.4-1 htmltools_0.4.0
[19] Matrix_1.2-18 XML_3.99-0.3 pkgconfig_2.0.3
[22] genefilter_1.68.0 zlibbioc_1.32.0 purrr_0.3.3
[25] xtable_1.8-4 scales_1.1.0 jpeg_0.1-8.1
[28] htmlTable_1.13.3 tibble_2.1.3 annotate_1.64.0
[31] ggplot2_3.2.1 nnet_7.3-12 lazyeval_0.2.2
[34] survival_3.1-8 magrittr_1.5 crayon_1.3.4
[37] memoise_1.1.0 MASS_7.3-51.5 foreign_0.8-75
[40] tools_3.6.0 data.table_1.12.8 lifecycle_0.1.0
[43] stringr_1.4.0 locfit_1.5-9.1 munsell_0.5.0
[46] cluster_2.1.0 AnnotationDbi_1.48.0 compiler_3.6.0
[49] rlang_0.4.2 grid_3.6.0 RCurl_1.98-1.1
[52] rstudioapi_0.10 htmlwidgets_1.5.1 bitops_1.0-6
[55] base64enc_0.1-3 gtable_0.3.0 DBI_1.1.0
[58] R6_2.4.1 gridExtra_2.3 knitr_1.27
[61] dplyr_0.8.3 zeallot_0.1.0 bit_1.1-15.1
[64] Hmisc_4.3-0 stringi_1.4.5 Rcpp_1.0.3
[67] geneplotter_1.64.0 vctrs_0.2.1 rpart_4.1-15
[70] acepack_1.4.1 png_0.1-7 tidyselect_0.2.5
[73] xfun_0.12
Here's the output from the last command with R/3.6.1:
estimating size factors
estimating dispersions
gene-wise dispersion estimates: 4 workers
mean-dispersion relationship
-- note: fitType='parametric', but the dispersion trend was not well captured by the
function: y = a/x + b, and a local regression fit was automatically substituted.
specify fitType='local' or 'mean' to avoid this message next time.
final dispersion estimates, fitting model and testing: 4 workers
-- replacing outliers and refitting for 1 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
user system elapsed
7.400 1.502 8.910
Here's my session info for R/3.6.1:
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /sfs/applications/201912/software/standard/compiler/gcc/7.1.0/openblas/0.2.19/lib/libopenblas_sandybridgep-r0.2.19.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] DESeq2_1.26.0 SummarizedExperiment_1.16.1
[3] DelayedArray_0.12.2 BiocParallel_1.20.1
[5] matrixStats_0.55.0 Biobase_2.46.0
[7] GenomicRanges_1.38.0 GenomeInfoDb_1.22.0
[9] IRanges_2.20.2 S4Vectors_0.24.3
[11] BiocGenerics_0.32.0
loaded via a namespace (and not attached):
[1] bit64_0.9-7 splines_3.6.1 Formula_1.2-3
[4] assertthat_0.2.1 latticeExtra_0.6-29 blob_1.2.1
[7] GenomeInfoDbData_1.2.2 pillar_1.4.3 RSQLite_2.2.0
[10] backports_1.1.5 lattice_0.20-38 glue_1.3.1
[13] digest_0.6.23 RColorBrewer_1.1-2 XVector_0.26.0
[16] checkmate_1.9.4 colorspace_1.4-1 htmltools_0.4.0
[19] Matrix_1.2-18 XML_3.99-0.3 pkgconfig_2.0.3
[22] genefilter_1.68.0 zlibbioc_1.32.0 purrr_0.3.3
[25] xtable_1.8-4 scales_1.1.0 jpeg_0.1-8.1
[28] htmlTable_1.13.3 tibble_2.1.3 annotate_1.64.0
[31] ggplot2_3.2.1 nnet_7.3-12 lazyeval_0.2.2
[34] survival_3.1-8 magrittr_1.5 crayon_1.3.4
[37] memoise_1.1.0 MASS_7.3-51.5 foreign_0.8-75
[40] tools_3.6.1 data.table_1.12.8 lifecycle_0.1.0
[43] stringr_1.4.0 locfit_1.5-9.1 munsell_0.5.0
[46] cluster_2.1.0 AnnotationDbi_1.48.0 compiler_3.6.1
[49] rlang_0.4.2 grid_3.6.1 RCurl_1.98-1.1
[52] rstudioapi_0.10 htmlwidgets_1.5.1 bitops_1.0-6
[55] base64enc_0.1-3 gtable_0.3.0 DBI_1.1.0
[58] R6_2.4.1 gridExtra_2.3 knitr_1.27
[61] dplyr_0.8.3 zeallot_0.1.0 bit_1.1-15.1
[64] Hmisc_4.3-0 stringi_1.4.5 Rcpp_1.0.3
[67] geneplotter_1.64.0 vctrs_0.2.1 rpart_4.1-15
[70] acepack_1.4.1 png_0.1-7 tidyselect_0.2.5
[73] xfun_0.12
On my machine, 2,000 genes with 2 workers takes 7.7 seconds, and 3,000 genes with 3 workers takes 8.6 seconds. This is typical that you wouldn't see linear decrease with number of workers, but that there is a speedup with parallelization nonetheless. You may want to be careful with setting up to 20 cores if you are working on a cluster where the cores may not be on the same node, this can substantially hurt performance, where you'd be better off without any parallelization at all. I've seen this happen multiple times from users on the support forum in fact.
Based on SLURM output, my parallel jobs are running on a single node. Nevertheless, additional trouble-shooting today with parallelization proved that v1.26.0 is indeed substantially faster than v1.22.2 when processing is restricted to a single core. v1.26.0 processed a toy dataset in a fraction of the time it took v1.22.2 to process the same dataset. In fact, v1.26.0 processed the dataset on a single core as fast as v1.22.2 processed the data on 10 cores. The problem I'm having is clearly with the parallelization, as v1.26.0 timed out when run on multiple cores, so I wish I could rename and retag my post to attract feedback on BiocParallel instead of DESeq2. Thanks again for your comments on parallelization.
Simplify the problem to make sure basics are ok, e..g,
bplapply(1:4, sqrt, BPPARAM=MulticoreParam())
. Remember that the cost of distributing data / retrieving results can be significant, and can outweigh gains if the computational 'work' done on core is small.Thank you for your prompt reply. I'm sorry for questioning the DESeq2 development team. I agree it's most likely the parallel backend. Running your sample code on my frontend works fine. The HPC team at my institution didn't have any promising leads, but I'm going to experiment more with different parallel arrangements on the backend. I'm satisfied this is most likely NOT a DESeq2 performance issue. Thank you for providing such a great package.