How to better run DEXSeq with a very large number of samples?
Entering edit mode
Angel ▴ 10
Last seen 3.1 years ago


I am trying to run DEXSeq splicing analysis on a very large dataset using the "AlternativeCountData" option. I have attached my DexSeqDataSet object - the data is huge, with 414939 exons and up to 200 conditions. Using the BPPARAM option, I am running the estimateDispersions step with 54 CPU cores, but it's really really slow - the progress bar has only moved to 2% in 24 hours when only doing 59 conditions (design = ~ sample + exon + condition:exon)

Is my dataset too big to run DEXSeq this way?

I have already filtered the dataset for exons with at least 5 reads in all replicates of any one sample. Ideally, I want to process the dataset in its entirety, which means considering only a subset of exons is sub-optimal. Is there a possible way to get through this?

I've looked at a few other forums which have addressed this issue but they're a few years old, so unfortunately I have little to go from.

I would really appreciate any help, because I don't have alot of tricks up my sleeve when handling DEXSeq and would like to learn more.

many thanks

Problematic code:

> d
class: DEXSeqDataSet 
dim: 414939 914 
metadata(1): version
assays(1): counts
rownames(414939): X_153693905_153694633:X_153693905_153694633 X_153693905_153694633:X_153693905_153694633.1 ... 17_42956557_42957137:17_42956557_42956560
rowData names(4): featureID groupID exonBaseMean exonBaseVar
colnames: NULL
colData names(8): sample condition ... exon sizeFactor

d = DEXSeq::estimateSizeFactors(d)

d = DEXSeq::estimateDispersions(d, BPPARAM = MulticoreParam(workers = 54, progressbar = TRUE))

R session info:

> sessionInfo( )
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/

 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8     LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gdtools_0.2.2               PFAM.db_3.11.4              AnnotationDbi_1.50.3        systemPipeR_1.22.0          ShortRead_1.46.0            GenomicAlignments_1.24.0   
 [7] SummarizedExperiment_1.18.2 DelayedArray_0.14.1         matrixStats_0.57.0          Biobase_2.48.0              BiocParallel_1.22.0         Rsamtools_2.4.0            
[13] Biostrings_2.56.0           XVector_0.28.0              biomaRt_2.44.1              Rtsne_0.15                  umap_0.2.7.0                Seurat_3.2.2               
[19] ComplexHeatmap_2.4.3        kohonen_3.0.10              qualV_0.3-3                 KernSmooth_2.23-18          rtracklayer_1.48.0          GenomicRanges_1.40.0       
[25] GenomeInfoDb_1.24.2         IRanges_2.22.2              S4Vectors_0.26.1            BiocGenerics_0.36.0         furrr_0.1.0                 future_1.18.0              
[31] genefilter_1.70.0           data.table_1.14.0           RColorBrewer_1.1-2          gtools_3.8.2                forcats_0.5.0               stringr_1.4.0              
[37] dplyr_1.0.2                 purrr_0.3.4                 readr_1.3.1                 tidyr_1.1.2                 tibble_3.0.3                ggplot2_3.3.2              
[43] tidyverse_1.3.0            

loaded via a namespace (and not attached):
  [1] rappdirs_0.3.1           AnnotationForge_1.30.1   bit64_4.0.5              knitr_1.30               irlba_2.3.3              rpart_4.1-15             hwriter_1.3.2           
  [8] RCurl_1.98-1.2           generics_0.1.0           GenomicFeatures_1.40.1   cowplot_1.1.0            RSQLite_2.2.1            RANN_2.6.1               bit_4.0.4               
 [15] base64url_1.4            spatstat.data_1.4-3      xml2_1.3.2               lubridate_1.7.9          httpuv_1.5.4             assertthat_0.2.1         batchtools_0.9.13       
 [22] xfun_0.17                hms_0.5.3                evaluate_0.14            promises_1.1.1           fansi_0.4.2              progress_1.2.2           dbplyr_1.4.4            
 [29] readxl_1.3.1             Rgraphviz_2.32.0         igraph_1.2.6             DBI_1.1.1                geneplotter_1.66.0       htmlwidgets_1.5.1        ellipsis_0.3.1          
 [36] RSpectra_0.16-0          backports_1.2.1          V8_3.2.0                 annotate_1.66.0          deldir_0.2-10            vctrs_0.3.4              ROCR_1.0-11             
 [43] abind_1.4-5              withr_2.2.0              DOT_0.1                  BSgenome_1.56.0          checkmate_2.0.0          sctransform_0.3.1        prettyunits_1.1.1       
 [50] goftest_1.2-2            svglite_1.2.3.2          cluster_2.1.0            lazyeval_0.2.2           crayon_1.4.1             edgeR_3.30.3             pkgconfig_2.0.3         
 [57] labeling_0.3             nlme_3.1-151             rlang_0.4.7              globals_0.14.0           lifecycle_0.2.0          miniUI_0.1.1.1           BiocFileCache_1.12.1    
 [64] GOstats_2.54.0           modelr_0.1.8             rsvd_1.0.3               cellranger_1.1.0         rsvg_2.1                 polyclip_1.10-0          lmtest_0.9-38           
 [71] graph_1.66.0             Matrix_1.3-2             zoo_1.8-8                reprex_0.3.0             ggridges_0.5.2           GlobalOptions_0.1.2      pheatmap_1.0.12         
 [78] png_0.1-7                viridisLite_0.3.0        rjson_0.2.20             bitops_1.0-6             blob_1.2.1               shape_1.4.5              brew_1.0-6              
 [85] jpeg_0.1-8.1             scales_1.1.1             memoise_1.1.0            GSEABase_1.50.1          magrittr_2.0.1           plyr_1.8.6               ica_1.0-2               
 [92] zlibbioc_1.34.0          compiler_4.0.3           tinytex_0.25             clue_0.3-58              DESeq2_1.28.1            fitdistrplus_1.1-3       cli_2.3.1               
 [99] listenv_0.8.0            Category_2.54.0          patchwork_1.0.1          pbapply_1.4-3            MASS_7.3-53              mgcv_1.8-33              tidyselect_1.1.0        
[106] stringi_1.5.3            yaml_2.2.1               askpass_1.1              locfit_1.5-9.4           latticeExtra_0.6-29      ggrepel_0.8.2            VariantAnnotation_1.34.0
[113] tools_4.0.3              future.apply_1.6.0       circlize_0.4.10          rstudioapi_0.11          DEXSeq_1.36.0            gridExtra_2.3            farver_2.1.0            
[120] digest_0.6.27            shiny_1.5.0              Rcpp_1.0.5               broom_0.7.0              later_1.1.0.1            RcppAnnoy_0.0.16         httr_1.4.2              
[127] colorspace_2.0-0         rvest_0.3.6              XML_3.99-0.5             fs_1.5.0                 tensor_1.5               reticulate_1.16          splines_4.0.3           
[134] uwot_0.1.8               RBGL_1.64.0              statmod_1.4.34           spatstat.utils_1.17-0    systemfonts_0.3.1        plotly_4.9.2.1           xtable_1.8-4            
[141] jsonlite_1.7.1           spatstat_1.64-1          R6_2.4.1                 pillar_1.4.6             htmltools_0.5.0          mime_0.9                 cpp11_0.2.6             
[148] glue_1.4.2               fastmap_1.1.0            codetools_0.2-18         utf8_1.1.4               lattice_0.20-41          curl_4.3                 leiden_0.3.3            
[155] GO.db_3.11.4             openssl_1.4.3            survival_3.2-7           limma_3.44.3             rmarkdown_2.5            munsell_0.5.0            GetoptLong_1.0.2        
[162] GenomeInfoDbData_1.2.4   haven_2.3.1              reshape2_1.4.4           gtable_0.3.0
DEXSeq • 1.0k views
Entering edit mode
Alejandro Reyes ★ 1.9k
Last seen 5 hours ago
Novartis Institutes for BioMedical Reseā€¦

Hi Angel,

Thanks for your interest. At the moment, the model used by DEXSeq fits a parameter for each sample. For large datasets like yours, it will be difficult to complete running times unless you throw a lot of CPUs and resources to it (many, many more of what you are already doing). We're trying to fix this but it might take some time to get to a solution. In the meantime, I'd recommend this paper:

They benchmark performance and running times of different methods. Seems that their proposed tool satuRn and DoubleExpSeq have similar performance to DEXSeq and don't have problems running in large number of samples. Worth giving them a try IMO.


Entering edit mode

thank you Alejandro for being so helpful as always! have a great day!


Login before adding your answer.

Traffic: 713 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6