Hi,
I am trying to run DEXSeq splicing analysis on a very large dataset using the "AlternativeCountData" option. I have attached my DexSeqDataSet object - the data is huge, with 414939 exons and up to 200 conditions. Using the BPPARAM option, I am running the estimateDispersions step with 54 CPU cores, but it's really really slow - the progress bar has only moved to 2% in 24 hours when only doing 59 conditions (design = ~ sample + exon + condition:exon)
Is my dataset too big to run DEXSeq this way?
I have already filtered the dataset for exons with at least 5 reads in all replicates of any one sample. Ideally, I want to process the dataset in its entirety, which means considering only a subset of exons is sub-optimal. Is there a possible way to get through this?
I've looked at a few other forums which have addressed this issue but they're a few years old, so unfortunately I have little to go from.
I would really appreciate any help, because I don't have alot of tricks up my sleeve when handling DEXSeq and would like to learn more.
many thanks
Problematic code:
> d
class: DEXSeqDataSet
dim: 414939 914
metadata(1): version
assays(1): counts
rownames(414939): X_153693905_153694633:X_153693905_153694633 X_153693905_153694633:X_153693905_153694633.1 ... 17_42956557_42957137:17_42956557_42956560
20_21326602_21328582:20_21328559_21328582
rowData names(4): featureID groupID exonBaseMean exonBaseVar
colnames: NULL
colData names(8): sample condition ... exon sizeFactor
d = DEXSeq::estimateSizeFactors(d)
d = DEXSeq::estimateDispersions(d, BPPARAM = MulticoreParam(workers = 54, progressbar = TRUE))
R session info:
> sessionInfo( )
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] grid parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] gdtools_0.2.2 PFAM.db_3.11.4 AnnotationDbi_1.50.3 systemPipeR_1.22.0 ShortRead_1.46.0 GenomicAlignments_1.24.0
[7] SummarizedExperiment_1.18.2 DelayedArray_0.14.1 matrixStats_0.57.0 Biobase_2.48.0 BiocParallel_1.22.0 Rsamtools_2.4.0
[13] Biostrings_2.56.0 XVector_0.28.0 biomaRt_2.44.1 Rtsne_0.15 umap_0.2.7.0 Seurat_3.2.2
[19] ComplexHeatmap_2.4.3 kohonen_3.0.10 qualV_0.3-3 KernSmooth_2.23-18 rtracklayer_1.48.0 GenomicRanges_1.40.0
[25] GenomeInfoDb_1.24.2 IRanges_2.22.2 S4Vectors_0.26.1 BiocGenerics_0.36.0 furrr_0.1.0 future_1.18.0
[31] genefilter_1.70.0 data.table_1.14.0 RColorBrewer_1.1-2 gtools_3.8.2 forcats_0.5.0 stringr_1.4.0
[37] dplyr_1.0.2 purrr_0.3.4 readr_1.3.1 tidyr_1.1.2 tibble_3.0.3 ggplot2_3.3.2
[43] tidyverse_1.3.0
loaded via a namespace (and not attached):
[1] rappdirs_0.3.1 AnnotationForge_1.30.1 bit64_4.0.5 knitr_1.30 irlba_2.3.3 rpart_4.1-15 hwriter_1.3.2
[8] RCurl_1.98-1.2 generics_0.1.0 GenomicFeatures_1.40.1 cowplot_1.1.0 RSQLite_2.2.1 RANN_2.6.1 bit_4.0.4
[15] base64url_1.4 spatstat.data_1.4-3 xml2_1.3.2 lubridate_1.7.9 httpuv_1.5.4 assertthat_0.2.1 batchtools_0.9.13
[22] xfun_0.17 hms_0.5.3 evaluate_0.14 promises_1.1.1 fansi_0.4.2 progress_1.2.2 dbplyr_1.4.4
[29] readxl_1.3.1 Rgraphviz_2.32.0 igraph_1.2.6 DBI_1.1.1 geneplotter_1.66.0 htmlwidgets_1.5.1 ellipsis_0.3.1
[36] RSpectra_0.16-0 backports_1.2.1 V8_3.2.0 annotate_1.66.0 deldir_0.2-10 vctrs_0.3.4 ROCR_1.0-11
[43] abind_1.4-5 withr_2.2.0 DOT_0.1 BSgenome_1.56.0 checkmate_2.0.0 sctransform_0.3.1 prettyunits_1.1.1
[50] goftest_1.2-2 svglite_1.2.3.2 cluster_2.1.0 lazyeval_0.2.2 crayon_1.4.1 edgeR_3.30.3 pkgconfig_2.0.3
[57] labeling_0.3 nlme_3.1-151 rlang_0.4.7 globals_0.14.0 lifecycle_0.2.0 miniUI_0.1.1.1 BiocFileCache_1.12.1
[64] GOstats_2.54.0 modelr_0.1.8 rsvd_1.0.3 cellranger_1.1.0 rsvg_2.1 polyclip_1.10-0 lmtest_0.9-38
[71] graph_1.66.0 Matrix_1.3-2 zoo_1.8-8 reprex_0.3.0 ggridges_0.5.2 GlobalOptions_0.1.2 pheatmap_1.0.12
[78] png_0.1-7 viridisLite_0.3.0 rjson_0.2.20 bitops_1.0-6 blob_1.2.1 shape_1.4.5 brew_1.0-6
[85] jpeg_0.1-8.1 scales_1.1.1 memoise_1.1.0 GSEABase_1.50.1 magrittr_2.0.1 plyr_1.8.6 ica_1.0-2
[92] zlibbioc_1.34.0 compiler_4.0.3 tinytex_0.25 clue_0.3-58 DESeq2_1.28.1 fitdistrplus_1.1-3 cli_2.3.1
[99] listenv_0.8.0 Category_2.54.0 patchwork_1.0.1 pbapply_1.4-3 MASS_7.3-53 mgcv_1.8-33 tidyselect_1.1.0
[106] stringi_1.5.3 yaml_2.2.1 askpass_1.1 locfit_1.5-9.4 latticeExtra_0.6-29 ggrepel_0.8.2 VariantAnnotation_1.34.0
[113] tools_4.0.3 future.apply_1.6.0 circlize_0.4.10 rstudioapi_0.11 DEXSeq_1.36.0 gridExtra_2.3 farver_2.1.0
[120] digest_0.6.27 shiny_1.5.0 Rcpp_1.0.5 broom_0.7.0 later_1.1.0.1 RcppAnnoy_0.0.16 httr_1.4.2
[127] colorspace_2.0-0 rvest_0.3.6 XML_3.99-0.5 fs_1.5.0 tensor_1.5 reticulate_1.16 splines_4.0.3
[134] uwot_0.1.8 RBGL_1.64.0 statmod_1.4.34 spatstat.utils_1.17-0 systemfonts_0.3.1 plotly_4.9.2.1 xtable_1.8-4
[141] jsonlite_1.7.1 spatstat_1.64-1 R6_2.4.1 pillar_1.4.6 htmltools_0.5.0 mime_0.9 cpp11_0.2.6
[148] glue_1.4.2 fastmap_1.1.0 codetools_0.2-18 utf8_1.1.4 lattice_0.20-41 curl_4.3 leiden_0.3.3
[155] GO.db_3.11.4 openssl_1.4.3 survival_3.2-7 limma_3.44.3 rmarkdown_2.5 munsell_0.5.0 GetoptLong_1.0.2
[162] GenomeInfoDbData_1.2.4 haven_2.3.1 reshape2_1.4.4 gtable_0.3.0
thank you Alejandro for being so helpful as always! have a great day!