#

Question

DESeq2 : Estimating dispersions takes forever to finish

0

Entering edit mode

Nicolas Rosewick ▴ 10

@nicolas-rosewick-10121

Last seen 4.2 years ago

Belgium/Brussels/ULB

Hello,

I've a count table of 60668 genes x 217 samples.

Using only ~ group takes ~10 minutes to finish. But using ~group + patient it takes forever (Now it's running for more than 12 hours and is stuck in gene-wise dispersion estimates: 5 workers

In the colData, the patient column is defined as :

metadata$patient %>% table() %>% table()
  1   2   3   4   5   6   8  15 
120  22   5   1   1   1   1   1

Thus the number of samples per patient are different depending of the patient. 120 patient have only 1 sample, but (at the other extreme) 1 patient has 15 samples

Is this expected to take so long ?

Thanks

Here is my code :

dds <- DESeqDataSetFromMatrix(countData = counts,colData = metadata,design = ~ group + patient)
dds <- estimateSizeFactors(dds)
keep <- rowSums(counts(dds, normalized=TRUE) >= 10) >= 10  # min 10 samples with 10 reads
dds <- dds[keep,]
# only 37001 genes are kept

# multithread DESeq2
library("BiocParallel")
register(MulticoreParam(5))
dds <- DESeq(dds,parallel = T)

#

R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 19.1

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=fr_BE.UTF-8       LC_NUMERIC=C               LC_TIME=fr_BE.UTF-8        LC_COLLATE=fr_BE.UTF-8     LC_MONETARY=fr_BE.UTF-8   
 [6] LC_MESSAGES=fr_BE.UTF-8    LC_PAPER=fr_BE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=fr_BE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggrepel_0.8.1               pheatmap_1.0.12             RColorBrewer_1.1-2          cowplot_1.0.0               forcats_0.4.0              
 [6] stringr_1.4.0               dplyr_0.8.4                 purrr_0.3.3                 readr_1.3.1                 tidyr_1.0.2                
[11] tibble_2.1.3                ggplot2_3.2.1               tidyverse_1.3.0             DESeq2_1.26.0               SummarizedExperiment_1.16.1
[16] DelayedArray_0.12.2         BiocParallel_1.20.1         matrixStats_0.55.0          Biobase_2.46.0              GenomicRanges_1.38.0       
[21] GenomeInfoDb_1.22.0         IRanges_2.20.2              S4Vectors_0.24.3            BiocGenerics_0.32.0        

loaded via a namespace (and not attached):
 [1] nlme_3.1-144           fs_1.3.1               bitops_1.0-6           lubridate_1.7.4        bit64_0.9-7            httr_1.4.1            
 [7] tools_3.6.2            backports_1.1.5        utf8_1.1.4             R6_2.4.1               rpart_4.1-15           Hmisc_4.3-1           
[13] DBI_1.1.0              lazyeval_0.2.2         colorspace_1.4-1       nnet_7.3-12            withr_2.1.2            tidyselect_1.0.0      
[19] gridExtra_2.3          bit_1.1-15.2           compiler_3.6.2         cli_2.0.1              rvest_0.3.5            htmlTable_1.13.3      
[25] xml2_1.2.2             labeling_0.3           scales_1.1.0           checkmate_2.0.0        genefilter_1.68.0      digest_0.6.23         
[31] foreign_0.8-75         XVector_0.26.0         base64enc_0.1-3        jpeg_0.1-8.1           pkgconfig_2.0.3        htmltools_0.4.0       
[37] dbplyr_1.4.2           readxl_1.3.1           htmlwidgets_1.5.1      rlang_0.4.4            rstudioapi_0.11        RSQLite_2.2.0         
[43] farver_2.0.3           generics_0.0.2         jsonlite_1.6.1         acepack_1.4.1          RCurl_1.98-1.1         magrittr_1.5          
[49] GenomeInfoDbData_1.2.2 Formula_1.2-3          Matrix_1.2-18          fansi_0.4.1            Rcpp_1.0.3             munsell_0.5.0         
[55] lifecycle_0.1.0        stringi_1.4.5          yaml_2.2.1             zlibbioc_1.32.0        grid_3.6.2             blob_1.2.1            
[61] crayon_1.3.4           lattice_0.20-38        haven_2.2.0            splines_3.6.2          annotate_1.64.0        hms_0.5.3             
[67] locfit_1.5-9.1         knitr_1.28             pillar_1.4.3           geneplotter_1.64.0     reprex_0.3.0           XML_3.99-0.3          
[73] glue_1.3.1             latticeExtra_0.6-29    BiocManager_1.30.10    data.table_1.12.8      modelr_0.1.5           png_0.1-7             
[79] vctrs_0.2.2            cellranger_1.1.0       gtable_0.3.0           assertthat_0.2.1       xfun_0.12              xtable_1.8-4          
[85] broom_0.5.4            survival_3.1-8         AnnotationDbi_1.48.0   memoise_1.1.0          cluster_2.1.0          ellipsis_0.3.0

deseq2 • 485 views

ADD COMMENT • link 4.2 years ago Nicolas Rosewick ▴ 10

score 0 · Answer 1 · 2020-02-11

0

Entering edit mode

Michael Love 42k

@mikelove

Last seen 16 hours ago

United States

Try without parallel and instead perform some simple prefiltering of genes with only a few counts across all samples. Usually users have issues because the parallel backend is making things slower than necessary.

ADD COMMENT • link 4.2 years ago Michael Love 42k

0

Entering edit mode

Also make sure you update to the latest version. You didn’t note your session info or version.

ADD REPLY • link 4.2 years ago Michael Love 42k

0

Entering edit mode

Thanks @Michael. I've the version 1.26 so the new speed optimization should be there. I will try without parallel

ADD REPLY • link 4.2 years ago Nicolas Rosewick ▴ 10