DESeq2 : Estimating dispersions takes forever to finish
1
0
Entering edit mode
@nicolas-rosewick-10121
Last seen 4.2 years ago
Belgium/Brussels/ULB

Hello,

I've a count table of 60668 genes x 217 samples.

Using only ~ group takes ~10 minutes to finish. But using ~group + patient it takes forever (Now it's running for more than 12 hours and is stuck in gene-wise dispersion estimates: 5 workers

In the colData, the patient column is defined as :

metadata$patient %>% table() %>% table()
  1   2   3   4   5   6   8  15 
120  22   5   1   1   1   1   1

Thus the number of samples per patient are different depending of the patient. 120 patient have only 1 sample, but (at the other extreme) 1 patient has 15 samples

Is this expected to take so long ?

Thanks

Here is my code :

dds <- DESeqDataSetFromMatrix(countData = counts,colData = metadata,design = ~ group + patient)
dds <- estimateSizeFactors(dds)
keep <- rowSums(counts(dds, normalized=TRUE) >= 10) >= 10  # min 10 samples with 10 reads
dds <- dds[keep,]
# only 37001 genes are kept

# multithread DESeq2
library("BiocParallel")
register(MulticoreParam(5))
dds <- DESeq(dds,parallel = T)

#

R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 19.1

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=fr_BE.UTF-8       LC_NUMERIC=C               LC_TIME=fr_BE.UTF-8        LC_COLLATE=fr_BE.UTF-8     LC_MONETARY=fr_BE.UTF-8   
 [6] LC_MESSAGES=fr_BE.UTF-8    LC_PAPER=fr_BE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=fr_BE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggrepel_0.8.1               pheatmap_1.0.12             RColorBrewer_1.1-2          cowplot_1.0.0               forcats_0.4.0              
 [6] stringr_1.4.0               dplyr_0.8.4                 purrr_0.3.3                 readr_1.3.1                 tidyr_1.0.2                
[11] tibble_2.1.3                ggplot2_3.2.1               tidyverse_1.3.0             DESeq2_1.26.0               SummarizedExperiment_1.16.1
[16] DelayedArray_0.12.2         BiocParallel_1.20.1         matrixStats_0.55.0          Biobase_2.46.0              GenomicRanges_1.38.0       
[21] GenomeInfoDb_1.22.0         IRanges_2.20.2              S4Vectors_0.24.3            BiocGenerics_0.32.0        

loaded via a namespace (and not attached):
 [1] nlme_3.1-144           fs_1.3.1               bitops_1.0-6           lubridate_1.7.4        bit64_0.9-7            httr_1.4.1            
 [7] tools_3.6.2            backports_1.1.5        utf8_1.1.4             R6_2.4.1               rpart_4.1-15           Hmisc_4.3-1           
[13] DBI_1.1.0              lazyeval_0.2.2         colorspace_1.4-1       nnet_7.3-12            withr_2.1.2            tidyselect_1.0.0      
[19] gridExtra_2.3          bit_1.1-15.2           compiler_3.6.2         cli_2.0.1              rvest_0.3.5            htmlTable_1.13.3      
[25] xml2_1.2.2             labeling_0.3           scales_1.1.0           checkmate_2.0.0        genefilter_1.68.0      digest_0.6.23         
[31] foreign_0.8-75         XVector_0.26.0         base64enc_0.1-3        jpeg_0.1-8.1           pkgconfig_2.0.3        htmltools_0.4.0       
[37] dbplyr_1.4.2           readxl_1.3.1           htmlwidgets_1.5.1      rlang_0.4.4            rstudioapi_0.11        RSQLite_2.2.0         
[43] farver_2.0.3           generics_0.0.2         jsonlite_1.6.1         acepack_1.4.1          RCurl_1.98-1.1         magrittr_1.5          
[49] GenomeInfoDbData_1.2.2 Formula_1.2-3          Matrix_1.2-18          fansi_0.4.1            Rcpp_1.0.3             munsell_0.5.0         
[55] lifecycle_0.1.0        stringi_1.4.5          yaml_2.2.1             zlibbioc_1.32.0        grid_3.6.2             blob_1.2.1            
[61] crayon_1.3.4           lattice_0.20-38        haven_2.2.0            splines_3.6.2          annotate_1.64.0        hms_0.5.3             
[67] locfit_1.5-9.1         knitr_1.28             pillar_1.4.3           geneplotter_1.64.0     reprex_0.3.0           XML_3.99-0.3          
[73] glue_1.3.1             latticeExtra_0.6-29    BiocManager_1.30.10    data.table_1.12.8      modelr_0.1.5           png_0.1-7             
[79] vctrs_0.2.2            cellranger_1.1.0       gtable_0.3.0           assertthat_0.2.1       xfun_0.12              xtable_1.8-4          
[85] broom_0.5.4            survival_3.1-8         AnnotationDbi_1.48.0   memoise_1.1.0          cluster_2.1.0          ellipsis_0.3.0
deseq2 • 485 views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 16 hours ago
United States

Try without parallel and instead perform some simple prefiltering of genes with only a few counts across all samples. Usually users have issues because the parallel backend is making things slower than necessary.

ADD COMMENT
0
Entering edit mode

Also make sure you update to the latest version. You didn’t note your session info or version.

ADD REPLY
0
Entering edit mode

Thanks @Michael. I've the version 1.26 so the new speed optimization should be there. I will try without parallel

ADD REPLY

Login before adding your answer.

Traffic: 890 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6