apparent DESeq2::DESeq() performance decline since v1.22.2
1
0
Entering edit mode
chd5n • 0
@chd5n-22749
Last seen 3.8 years ago

Using DESeq2 v1.22.2 with Bioconductor version 3.8 and R 3.5.1, the following command consistently completes in fewer than 3 minutes:

numWorkers <- 20
dds <- DESeq(dds, parallel=TRUE, BPPARAM=MulticoreParam(numWorkers))

dds is a 23,680 x 272 DESeqDataSet derived from a subset of TCGA HTSeq count matrices.

Using DESeq2 v1.24.0 with Bioconductor version 3.9 and R 3.6.0, the same command (on the same dataset) gets hung at the "gene-wise dispersion estimates: 20 workers" step. On my HPC, R/3.6.0 requires openmpi/3.1.4, so I think that was part of the problem.

Using DESeq2 v1.26.0 with Bioconductor version 3.10 and R 3.6.1 (without openmpi/3.1.4), the same command (on the same dataset) has erratic behavior; generally, the command gets hung at the "gene-wise dispersion estimates: 20 workers" step. If I provide a row-limited count matrix to the command (e.g. with, for example, dds[1:3000,]), the command usually runs to completion but sometimes gets hung at the "final dispersion estimates, fitting model and testing: 20 workers" step. The memory usage is also erratic, sometimes completing the command with minimal memory (e.g. 3GB) and other times requiring >50GB.

My question is this: has something changed in recent versions of Bioconductor or DESeq2 that would help explain the difference in performance from DESeq2 v1.22.2 with Bioconductor version 3.8 and R 3.5.1?

I would like to troubleshoot this issue on my own, but I lack the expertise to debug this problem, so I am looking for guidance on how to proceed. I am aware of this reply https://support.bioconductor.org/p/107481/, and I'm happy to use limma-voom or edgeR (or my reliable outdated version of DESeq2), but I'm interested in the performance change from the prior version of DESeq2.

I simulate my data with the following toy count matrix, and the process completes using both DESeq2 v1.22.2 in R/3.5.1 and DESeq2 v1.26.0 in R/3.6.1, although it did hang indefinitely with DESeq2 v1.24.0 when running in R/3.6.0 (v1.26.0 in R/3.6.0 runs with an error and does not hang, as shown below):

## load the library in question
library(DESeq2)

## set toy size (i.e. for the matrix to be created)
sample_size <- 30
gene_number <- 1000

## create a toy count matrix
sample_names <- paste(rep("sample"),1:sample_size,sep="_")
toy_counts <- MASS::rnegbin(gene_number, theta=1)
for (i in 1:(length(sample_names)-1)) {
  new_toy_counts <- MASS::rnegbin(gene_number, theta=1)
  toy_counts <- cbind(toy_counts, new_toy_counts)
}
toy_count_matrix <- data.frame(toy_counts)
colnames(toy_count_matrix) <- sample_names
rownames(toy_count_matrix) <- paste(rep("gene_id"),1:nrow(toy_count_matrix),sep="_")

## create a toy annotation
toy_col_data <- data.frame(sample_name=sample_names,category=c(rep("A",length(sample_names)/2),rep("B",length(sample_names)/2)))

## create a toy dds
dds <- DESeqDataSetFromMatrix(toy_count_matrix, toy_col_data, ~category)

## test execution time
system.time(DESeq(dds, parallel=TRUE, BPPARAM=MulticoreParam(4)))

Here's the output from the last command with R/3.5.1:

estimating size factors
estimating dispersions
gene-wise dispersion estimates: 4 workers
mean-dispersion relationship
-- note: fitType='parametric', but the dispersion trend was not well captured by the
   function: y = a/x + b, and a local regression fit was automatically substituted.
   specify fitType='local' or 'mean' to avoid this message next time.
final dispersion estimates, fitting model and testing: 4 workers
-- replacing outliers and refitting for 1 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
   user  system elapsed
  4.009   1.391   5.408

Here's my session info for R/3.5.1:

R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /sfs/applications/201912/software/standard/compiler/gcc/7.1.0/R/3.5.1/lib64/R/lib/libRblas.so
LAPACK: /sfs/applications/201912/software/standard/compiler/gcc/7.1.0/R/3.5.1/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] DESeq2_1.22.2               SummarizedExperiment_1.12.0
 [3] DelayedArray_0.8.0          BiocParallel_1.16.6
 [5] matrixStats_0.55.0          Biobase_2.42.0
 [7] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2
 [9] IRanges_2.16.0              S4Vectors_0.20.1
[11] BiocGenerics_0.28.0

loaded via a namespace (and not attached):
 [1] bit64_0.9-7            splines_3.5.1          Formula_1.2-3
 [4] assertthat_0.2.1       latticeExtra_0.6-28    blob_1.2.0
 [7] GenomeInfoDbData_1.2.0 pillar_1.4.2           RSQLite_2.1.2
[10] backports_1.1.5        lattice_0.20-38        glue_1.3.1
[13] digest_0.6.22          RColorBrewer_1.1-2     XVector_0.22.0
[16] checkmate_1.9.4        colorspace_1.4-1       htmltools_0.4.0
[19] Matrix_1.2-17          XML_3.98-1.20          pkgconfig_2.0.3
[22] genefilter_1.64.0      zlibbioc_1.28.0        purrr_0.3.3
[25] xtable_1.8-4           scales_1.0.0           htmlTable_1.13.2
[28] tibble_2.1.3           annotate_1.60.1        ggplot2_3.2.1
[31] nnet_7.3-12            lazyeval_0.2.2         survival_2.44-1.1
[34] magrittr_1.5           crayon_1.3.4           memoise_1.1.0
[37] MASS_7.3-51.4          foreign_0.8-72         tools_3.5.1
[40] data.table_1.12.6      stringr_1.4.0          locfit_1.5-9.1
[43] munsell_0.5.0          cluster_2.1.0          AnnotationDbi_1.44.0
[46] compiler_3.5.1         rlang_0.4.1            grid_3.5.1
[49] RCurl_1.95-4.12        rstudioapi_0.10        htmlwidgets_1.5.1
[52] bitops_1.0-6           base64enc_0.1-3        gtable_0.3.0
[55] DBI_1.0.0              R6_2.4.0               gridExtra_2.3
[58] knitr_1.25             dplyr_0.8.3            bit_1.1-14
[61] zeallot_0.1.0          Hmisc_4.2-0            stringi_1.4.3
[64] Rcpp_1.0.2             vctrs_0.2.0            geneplotter_1.60.0
[67] rpart_4.1-15           acepack_1.4.1          tidyselect_0.2.5
[70] xfun_0.10

Here's the output from the last command with R/3.6.0:

estimating size factors
estimating dispersions
gene-wise dispersion estimates: 4 workers
Error in mcexit(0L) : ignoring SIGPIPE signal
Error in mcexit(0L) : ignoring SIGPIPE signal
Error in mcexit(0L) : ignoring SIGPIPE signal
mean-dispersion relationship
final dispersion estimates, fitting model and testing: 4 workers
Error in mcexit(0L) : ignoring SIGPIPE signal
Error in mcexit(0L) : ignoring SIGPIPE signal
Error in mcexit(0L) : ignoring SIGPIPE signal
Error in mcexit(0L) : ignoring SIGPIPE signal
   user  system elapsed
  5.612   1.453   7.072

Here's my session info for R/3.6.0:

R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /sfs/applications/201912/software/standard/compiler/gcc/7.1.0/openblas/0.2.19/lib/libopenblas_sandybridgep-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] DESeq2_1.26.0               SummarizedExperiment_1.16.1
 [3] DelayedArray_0.12.2         BiocParallel_1.20.1
 [5] matrixStats_0.55.0          Biobase_2.46.0
 [7] GenomicRanges_1.38.0        GenomeInfoDb_1.22.0
 [9] IRanges_2.20.2              S4Vectors_0.24.3
[11] BiocGenerics_0.32.0

loaded via a namespace (and not attached):
 [1] bit64_0.9-7            splines_3.6.0          Formula_1.2-3
 [4] assertthat_0.2.1       latticeExtra_0.6-29    blob_1.2.1
 [7] GenomeInfoDbData_1.2.2 pillar_1.4.3           RSQLite_2.2.0
[10] backports_1.1.5        lattice_0.20-38        glue_1.3.1
[13] digest_0.6.23          RColorBrewer_1.1-2     XVector_0.26.0
[16] checkmate_1.9.4        colorspace_1.4-1       htmltools_0.4.0
[19] Matrix_1.2-18          XML_3.99-0.3           pkgconfig_2.0.3
[22] genefilter_1.68.0      zlibbioc_1.32.0        purrr_0.3.3
[25] xtable_1.8-4           scales_1.1.0           jpeg_0.1-8.1
[28] htmlTable_1.13.3       tibble_2.1.3           annotate_1.64.0
[31] ggplot2_3.2.1          nnet_7.3-12            lazyeval_0.2.2
[34] survival_3.1-8         magrittr_1.5           crayon_1.3.4
[37] memoise_1.1.0          MASS_7.3-51.5          foreign_0.8-75
[40] tools_3.6.0            data.table_1.12.8      lifecycle_0.1.0
[43] stringr_1.4.0          locfit_1.5-9.1         munsell_0.5.0
[46] cluster_2.1.0          AnnotationDbi_1.48.0   compiler_3.6.0
[49] rlang_0.4.2            grid_3.6.0             RCurl_1.98-1.1
[52] rstudioapi_0.10        htmlwidgets_1.5.1      bitops_1.0-6
[55] base64enc_0.1-3        gtable_0.3.0           DBI_1.1.0
[58] R6_2.4.1               gridExtra_2.3          knitr_1.27
[61] dplyr_0.8.3            zeallot_0.1.0          bit_1.1-15.1
[64] Hmisc_4.3-0            stringi_1.4.5          Rcpp_1.0.3
[67] geneplotter_1.64.0     vctrs_0.2.1            rpart_4.1-15
[70] acepack_1.4.1          png_0.1-7              tidyselect_0.2.5
[73] xfun_0.12

Here's the output from the last command with R/3.6.1:

estimating size factors
estimating dispersions
gene-wise dispersion estimates: 4 workers
mean-dispersion relationship
-- note: fitType='parametric', but the dispersion trend was not well captured by the
   function: y = a/x + b, and a local regression fit was automatically substituted.
   specify fitType='local' or 'mean' to avoid this message next time.
final dispersion estimates, fitting model and testing: 4 workers
-- replacing outliers and refitting for 1 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
   user  system elapsed
  7.400   1.502   8.910

Here's my session info for R/3.6.1:

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /sfs/applications/201912/software/standard/compiler/gcc/7.1.0/openblas/0.2.19/lib/libopenblas_sandybridgep-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] DESeq2_1.26.0               SummarizedExperiment_1.16.1
 [3] DelayedArray_0.12.2         BiocParallel_1.20.1
 [5] matrixStats_0.55.0          Biobase_2.46.0
 [7] GenomicRanges_1.38.0        GenomeInfoDb_1.22.0
 [9] IRanges_2.20.2              S4Vectors_0.24.3
[11] BiocGenerics_0.32.0

loaded via a namespace (and not attached):
 [1] bit64_0.9-7            splines_3.6.1          Formula_1.2-3
 [4] assertthat_0.2.1       latticeExtra_0.6-29    blob_1.2.1
 [7] GenomeInfoDbData_1.2.2 pillar_1.4.3           RSQLite_2.2.0
[10] backports_1.1.5        lattice_0.20-38        glue_1.3.1
[13] digest_0.6.23          RColorBrewer_1.1-2     XVector_0.26.0
[16] checkmate_1.9.4        colorspace_1.4-1       htmltools_0.4.0
[19] Matrix_1.2-18          XML_3.99-0.3           pkgconfig_2.0.3
[22] genefilter_1.68.0      zlibbioc_1.32.0        purrr_0.3.3
[25] xtable_1.8-4           scales_1.1.0           jpeg_0.1-8.1
[28] htmlTable_1.13.3       tibble_2.1.3           annotate_1.64.0
[31] ggplot2_3.2.1          nnet_7.3-12            lazyeval_0.2.2
[34] survival_3.1-8         magrittr_1.5           crayon_1.3.4
[37] memoise_1.1.0          MASS_7.3-51.5          foreign_0.8-75
[40] tools_3.6.1            data.table_1.12.8      lifecycle_0.1.0
[43] stringr_1.4.0          locfit_1.5-9.1         munsell_0.5.0
[46] cluster_2.1.0          AnnotationDbi_1.48.0   compiler_3.6.1
[49] rlang_0.4.2            grid_3.6.1             RCurl_1.98-1.1
[52] rstudioapi_0.10        htmlwidgets_1.5.1      bitops_1.0-6
[55] base64enc_0.1-3        gtable_0.3.0           DBI_1.1.0
[58] R6_2.4.1               gridExtra_2.3          knitr_1.27
[61] dplyr_0.8.3            zeallot_0.1.0          bit_1.1-15.1
[64] Hmisc_4.3-0            stringi_1.4.5          Rcpp_1.0.3
[67] geneplotter_1.64.0     vctrs_0.2.1            rpart_4.1-15
[70] acepack_1.4.1          png_0.1-7              tidyselect_0.2.5
[73] xfun_0.12
DESeq2 • 742 views
ADD COMMENT
2
Entering edit mode
@mikelove
Last seen 3 hours ago
United States

I would guess it's not DESeq2 but the parallel backend perhaps.

Can you try just the 1,000 genes without any parallelization for your target ~270 samples?

This takes 6 seconds on my machine per 1,000 genes:

> dds <- makeExampleDESeqDataSet(n=1000,m=270)
> system.time({dds <- DESeq(dds)})
using pre-existing size factors
estimating dispersions
found already estimated dispersions, replacing these
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
   user  system elapsed 
  5.763   0.019   5.784

In fact we have sped up DESeq2 quite substantially in version 1.26.0:

http://bioconductor.org/packages/release/bioc/news/DESeq2/NEWS

ADD COMMENT
1
Entering edit mode

On my machine, 2,000 genes with 2 workers takes 7.7 seconds, and 3,000 genes with 3 workers takes 8.6 seconds. This is typical that you wouldn't see linear decrease with number of workers, but that there is a speedup with parallelization nonetheless. You may want to be careful with setting up to 20 cores if you are working on a cluster where the cores may not be on the same node, this can substantially hurt performance, where you'd be better off without any parallelization at all. I've seen this happen multiple times from users on the support forum in fact.

ADD REPLY
0
Entering edit mode

Based on SLURM output, my parallel jobs are running on a single node. Nevertheless, additional trouble-shooting today with parallelization proved that v1.26.0 is indeed substantially faster than v1.22.2 when processing is restricted to a single core. v1.26.0 processed a toy dataset in a fraction of the time it took v1.22.2 to process the same dataset. In fact, v1.26.0 processed the dataset on a single core as fast as v1.22.2 processed the data on 10 cores. The problem I'm having is clearly with the parallelization, as v1.26.0 timed out when run on multiple cores, so I wish I could rename and retag my post to attract feedback on BiocParallel instead of DESeq2. Thanks again for your comments on parallelization.

ADD REPLY
1
Entering edit mode

Simplify the problem to make sure basics are ok, e..g, bplapply(1:4, sqrt, BPPARAM=MulticoreParam()). Remember that the cost of distributing data / retrieving results can be significant, and can outweigh gains if the computational 'work' done on core is small.

ADD REPLY
0
Entering edit mode

Thank you for your prompt reply. I'm sorry for questioning the DESeq2 development team. I agree it's most likely the parallel backend. Running your sample code on my frontend works fine. The HPC team at my institution didn't have any promising leads, but I'm going to experiment more with different parallel arrangements on the backend. I'm satisfied this is most likely NOT a DESeq2 performance issue. Thank you for providing such a great package.

ADD REPLY

Login before adding your answer.

Traffic: 685 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6