DESeq function stuck on estimating gene-wise dispersion.
1
0
Entering edit mode
m.deluca • 0
@3100e08f
Last seen 4 weeks ago
Australia

I am running a differential gene expression analysis on the TCGA-BRCA cohort to investigate the effect of tumour cellularity on differential gene expression analysis (how the biology inferred from these analyses may differ if cellularity is included as a covariate). My experimental setup include running the DESeq function with (1) batch effects (BatchId + TSS) and cellularity (ESTIMATE) as a covariate and (2) batch effects (BatchId + TSS) only. The DGE analysis is done between two subtypes of breast cancer (encoded as the HR_status variable)

# Generate DESeq object from raw counts and include batch effects and  cellularity as coveriate
DEseq.pur <- DESeqDataSetFromMatrix(countData = rounded.est,
                                    colData = pca.plots.all.purity.est,
                                    design = ~ HR_status + BatchId + TSS + ESTIMATE)

# Object generate without cellularity as a covariate
DEseq.nopur <- DESeqDataSetFromMatrix(countData = rounded.est,
                                    colData = pca.plots.all.purity.est,
                                    design = ~ HR_status + BatchId + TSS)

After filtering the initial dataset to remove low count genes, I have a DESeqDataSet consisting of 33980 genes and 823 samples. The problem is that for experiment 2 (batch effects only), the DESeq function is not running to completion, despite allowing for more than 12 hours and 10 cpus. The function is stuck on estimating gene-wise dispersion.

library(DESeq2)
library(BiocParallel)
load("DEseq.est.Rdata")
DEseq.est.run <- DESeq(DEseq.est, parallel=TRUE, BPPARAM=MulticoreParam(10))

Weird thing is that if run experiment 1 (batch effects and cellularity as a covariate) with the same specs, this time DESeq runs in 5 hours. I can't figure out what is happening to cause this....

Output of sessionInfo:

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /home/uqmdeluc/R-3.6.1/lib/libRblas.so
LAPACK: /home/uqmdeluc/R-3.6.1/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] DESeq2_1.26.0               SummarizedExperiment_1.16.1
 [3] DelayedArray_0.12.3         BiocParallel_1.20.1
 [5] matrixStats_0.58.0          Biobase_2.46.0
 [7] GenomicRanges_1.38.0        GenomeInfoDb_1.22.1
 [9] IRanges_2.20.2              S4Vectors_0.24.4
[11] BiocGenerics_0.32.0

loaded via a namespace (and not attached):
 [1] bit64_4.0.5            splines_3.6.1          Formula_1.2-4
 [4] assertthat_0.2.1       latticeExtra_0.6-29    blob_1.2.1
 [7] GenomeInfoDbData_1.2.2 pillar_1.5.0           RSQLite_2.2.3
[10] backports_1.2.1        lattice_0.20-38        glue_1.4.2
[13] digest_0.6.27          RColorBrewer_1.1-2     XVector_0.26.0
[16] checkmate_2.0.0        colorspace_2.0-0       htmltools_0.5.1.1
[19] Matrix_1.2-17          XML_3.99-0.3           pkgconfig_2.0.3
[22] genefilter_1.68.0      zlibbioc_1.32.0        purrr_0.3.4
[25] xtable_1.8-4           scales_1.1.1           jpeg_0.1-8.1
[28] htmlTable_2.1.0        tibble_3.1.0           annotate_1.64.0
[31] generics_0.1.0         ggplot2_3.3.3          ellipsis_0.3.1
[34] cachem_1.0.4           nnet_7.3-12            survival_3.2-7
[37] magrittr_2.0.1         crayon_1.4.1           memoise_2.0.0
[40] fansi_0.4.2            foreign_0.8-71         tools_3.6.1
[43] data.table_1.14.0      lifecycle_1.0.0        stringr_1.4.0
[46] locfit_1.5-9.4         munsell_0.5.0          cluster_2.1.0
[49] AnnotationDbi_1.48.0   compiler_3.6.1         rlang_0.4.10
[52] grid_3.6.1             RCurl_1.98-1.2         rstudioapi_0.13
[55] htmlwidgets_1.5.3      bitops_1.0-6           base64enc_0.1-3
[58] gtable_0.3.0           DBI_1.1.1              R6_2.5.0
[61] gridExtra_2.3          knitr_1.31             dplyr_1.0.4
[64] fastmap_1.1.0          bit_4.0.4              utf8_1.1.4
[67] Hmisc_4.5-0            stringi_1.5.3          Rcpp_1.0.6
[70] geneplotter_1.64.0     vctrs_0.3.6            rpart_4.1-15
[73] png_0.1-7              tidyselect_1.1.0       xfun_0.21
DESeq2 • 147 views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 3 days ago
United States

Sometimes using the parallelization hurts more than helps because you are sending a big dataset around on the cluster, and that becomes slower than the estimation itself. You can find some other posts on the support site where I give recommendations, but for datasets with hundreds of samples, we often use limma-voom in my lab.

ADD COMMENT
0
Entering edit mode

I'll try your suggestions ! Thanks

ADD REPLY

Login before adding your answer.

Traffic: 304 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6