I am running a differential gene expression analysis on the TCGA-BRCA cohort to investigate the effect of tumour cellularity on differential gene expression analysis (how the biology inferred from these analyses may differ if cellularity is included as a covariate). My experimental setup include running the DESeq function with (1) batch effects (BatchId + TSS) and cellularity (ESTIMATE) as a covariate and (2) batch effects (BatchId + TSS) only. The DGE analysis is done between two subtypes of breast cancer (encoded as the HR_status variable)
# Generate DESeq object from raw counts and include batch effects and cellularity as coveriate
DEseq.pur <- DESeqDataSetFromMatrix(countData = rounded.est,
colData = pca.plots.all.purity.est,
design = ~ HR_status + BatchId + TSS + ESTIMATE)
# Object generate without cellularity as a covariate
DEseq.nopur <- DESeqDataSetFromMatrix(countData = rounded.est,
colData = pca.plots.all.purity.est,
design = ~ HR_status + BatchId + TSS)
After filtering the initial dataset to remove low count genes, I have a DESeqDataSet consisting of 33980 genes and 823 samples. The problem is that for experiment 2 (batch effects only), the DESeq function is not running to completion, despite allowing for more than 12 hours and 10 cpus. The function is stuck on estimating gene-wise dispersion.
library(DESeq2)
library(BiocParallel)
load("DEseq.est.Rdata")
DEseq.est.run <- DESeq(DEseq.est, parallel=TRUE, BPPARAM=MulticoreParam(10))
Weird thing is that if run experiment 1 (batch effects and cellularity as a covariate) with the same specs, this time DESeq runs in 5 hours. I can't figure out what is happening to cause this....
Output of sessionInfo:
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /home/uqmdeluc/R-3.6.1/lib/libRblas.so
LAPACK: /home/uqmdeluc/R-3.6.1/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
[5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] DESeq2_1.26.0 SummarizedExperiment_1.16.1
[3] DelayedArray_0.12.3 BiocParallel_1.20.1
[5] matrixStats_0.58.0 Biobase_2.46.0
[7] GenomicRanges_1.38.0 GenomeInfoDb_1.22.1
[9] IRanges_2.20.2 S4Vectors_0.24.4
[11] BiocGenerics_0.32.0
loaded via a namespace (and not attached):
[1] bit64_4.0.5 splines_3.6.1 Formula_1.2-4
[4] assertthat_0.2.1 latticeExtra_0.6-29 blob_1.2.1
[7] GenomeInfoDbData_1.2.2 pillar_1.5.0 RSQLite_2.2.3
[10] backports_1.2.1 lattice_0.20-38 glue_1.4.2
[13] digest_0.6.27 RColorBrewer_1.1-2 XVector_0.26.0
[16] checkmate_2.0.0 colorspace_2.0-0 htmltools_0.5.1.1
[19] Matrix_1.2-17 XML_3.99-0.3 pkgconfig_2.0.3
[22] genefilter_1.68.0 zlibbioc_1.32.0 purrr_0.3.4
[25] xtable_1.8-4 scales_1.1.1 jpeg_0.1-8.1
[28] htmlTable_2.1.0 tibble_3.1.0 annotate_1.64.0
[31] generics_0.1.0 ggplot2_3.3.3 ellipsis_0.3.1
[34] cachem_1.0.4 nnet_7.3-12 survival_3.2-7
[37] magrittr_2.0.1 crayon_1.4.1 memoise_2.0.0
[40] fansi_0.4.2 foreign_0.8-71 tools_3.6.1
[43] data.table_1.14.0 lifecycle_1.0.0 stringr_1.4.0
[46] locfit_1.5-9.4 munsell_0.5.0 cluster_2.1.0
[49] AnnotationDbi_1.48.0 compiler_3.6.1 rlang_0.4.10
[52] grid_3.6.1 RCurl_1.98-1.2 rstudioapi_0.13
[55] htmlwidgets_1.5.3 bitops_1.0-6 base64enc_0.1-3
[58] gtable_0.3.0 DBI_1.1.1 R6_2.5.0
[61] gridExtra_2.3 knitr_1.31 dplyr_1.0.4
[64] fastmap_1.1.0 bit_4.0.4 utf8_1.1.4
[67] Hmisc_4.5-0 stringi_1.5.3 Rcpp_1.0.6
[70] geneplotter_1.64.0 vctrs_0.3.6 rpart_4.1-15
[73] png_0.1-7 tidyselect_1.1.0 xfun_0.21

I'll try your suggestions ! Thanks