Question

How to better run DEXSeq with a very large number of samples?

0

Entering edit mode

Angel ▴ 10

@angel-24954

Last seen 3.3 years ago

Hi,

I am trying to run DEXSeq splicing analysis on a very large dataset using the "AlternativeCountData" option. I have attached my DexSeqDataSet object - the data is huge, with 414939 exons and up to 200 conditions. Using the BPPARAM option, I am running the estimateDispersions step with 54 CPU cores, but it's really really slow - the progress bar has only moved to 2% in 24 hours when only doing 59 conditions (design = ~ sample + exon + condition:exon)

Is my dataset too big to run DEXSeq this way?

I have already filtered the dataset for exons with at least 5 reads in all replicates of any one sample. Ideally, I want to process the dataset in its entirety, which means considering only a subset of exons is sub-optimal. Is there a possible way to get through this?

I've looked at a few other forums which have addressed this issue but they're a few years old, so unfortunately I have little to go from.

I would really appreciate any help, because I don't have alot of tricks up my sleeve when handling DEXSeq and would like to learn more.

many thanks

Problematic code:


> d
class: DEXSeqDataSet 
dim: 414939 914 
metadata(1): version
assays(1): counts
rownames(414939): X_153693905_153694633:X_153693905_153694633 X_153693905_153694633:X_153693905_153694633.1 ... 17_42956557_42957137:17_42956557_42956560
  20_21326602_21328582:20_21328559_21328582
rowData names(4): featureID groupID exonBaseMean exonBaseVar
colnames: NULL
colData names(8): sample condition ... exon sizeFactor

d = DEXSeq::estimateSizeFactors(d)

d = DEXSeq::estimateDispersions(d, BPPARAM = MulticoreParam(workers = 54, progressbar = TRUE))

R session info:


> sessionInfo( )
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8     LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gdtools_0.2.2               PFAM.db_3.11.4              AnnotationDbi_1.50.3        systemPipeR_1.22.0          ShortRead_1.46.0            GenomicAlignments_1.24.0   
 [7] SummarizedExperiment_1.18.2 DelayedArray_0.14.1         matrixStats_0.57.0          Biobase_2.48.0              BiocParallel_1.22.0         Rsamtools_2.4.0            
[13] Biostrings_2.56.0           XVector_0.28.0              biomaRt_2.44.1              Rtsne_0.15                  umap_0.2.7.0                Seurat_3.2.2               
[19] ComplexHeatmap_2.4.3        kohonen_3.0.10              qualV_0.3-3                 KernSmooth_2.23-18          rtracklayer_1.48.0          GenomicRanges_1.40.0       
[25] GenomeInfoDb_1.24.2         IRanges_2.22.2              S4Vectors_0.26.1            BiocGenerics_0.36.0         furrr_0.1.0                 future_1.18.0              
[31] genefilter_1.70.0           data.table_1.14.0           RColorBrewer_1.1-2          gtools_3.8.2                forcats_0.5.0               stringr_1.4.0              
[37] dplyr_1.0.2                 purrr_0.3.4                 readr_1.3.1                 tidyr_1.1.2                 tibble_3.0.3                ggplot2_3.3.2              
[43] tidyverse_1.3.0            

loaded via a namespace (and not attached):
  [1] rappdirs_0.3.1           AnnotationForge_1.30.1   bit64_4.0.5              knitr_1.30               irlba_2.3.3              rpart_4.1-15             hwriter_1.3.2           
  [8] RCurl_1.98-1.2           generics_0.1.0           GenomicFeatures_1.40.1   cowplot_1.1.0            RSQLite_2.2.1            RANN_2.6.1               bit_4.0.4               
 [15] base64url_1.4            spatstat.data_1.4-3      xml2_1.3.2               lubridate_1.7.9          httpuv_1.5.4             assertthat_0.2.1         batchtools_0.9.13       
 [22] xfun_0.17                hms_0.5.3                evaluate_0.14            promises_1.1.1           fansi_0.4.2              progress_1.2.2           dbplyr_1.4.4            
 [29] readxl_1.3.1             Rgraphviz_2.32.0         igraph_1.2.6             DBI_1.1.1                geneplotter_1.66.0       htmlwidgets_1.5.1        ellipsis_0.3.1          
 [36] RSpectra_0.16-0          backports_1.2.1          V8_3.2.0                 annotate_1.66.0          deldir_0.2-10            vctrs_0.3.4              ROCR_1.0-11             
 [43] abind_1.4-5              withr_2.2.0              DOT_0.1                  BSgenome_1.56.0          checkmate_2.0.0          sctransform_0.3.1        prettyunits_1.1.1       
 [50] goftest_1.2-2            svglite_1.2.3.2          cluster_2.1.0            lazyeval_0.2.2           crayon_1.4.1             edgeR_3.30.3             pkgconfig_2.0.3         
 [57] labeling_0.3             nlme_3.1-151             rlang_0.4.7              globals_0.14.0           lifecycle_0.2.0          miniUI_0.1.1.1           BiocFileCache_1.12.1    
 [64] GOstats_2.54.0           modelr_0.1.8             rsvd_1.0.3               cellranger_1.1.0         rsvg_2.1                 polyclip_1.10-0          lmtest_0.9-38           
 [71] graph_1.66.0             Matrix_1.3-2             zoo_1.8-8                reprex_0.3.0             ggridges_0.5.2           GlobalOptions_0.1.2      pheatmap_1.0.12         
 [78] png_0.1-7                viridisLite_0.3.0        rjson_0.2.20             bitops_1.0-6             blob_1.2.1               shape_1.4.5              brew_1.0-6              
 [85] jpeg_0.1-8.1             scales_1.1.1             memoise_1.1.0            GSEABase_1.50.1          magrittr_2.0.1           plyr_1.8.6               ica_1.0-2               
 [92] zlibbioc_1.34.0          compiler_4.0.3           tinytex_0.25             clue_0.3-58              DESeq2_1.28.1            fitdistrplus_1.1-3       cli_2.3.1               
 [99] listenv_0.8.0            Category_2.54.0          patchwork_1.0.1          pbapply_1.4-3            MASS_7.3-53              mgcv_1.8-33              tidyselect_1.1.0        
[106] stringi_1.5.3            yaml_2.2.1               askpass_1.1              locfit_1.5-9.4           latticeExtra_0.6-29      ggrepel_0.8.2            VariantAnnotation_1.34.0
[113] tools_4.0.3              future.apply_1.6.0       circlize_0.4.10          rstudioapi_0.11          DEXSeq_1.36.0            gridExtra_2.3            farver_2.1.0            
[120] digest_0.6.27            shiny_1.5.0              Rcpp_1.0.5               broom_0.7.0              later_1.1.0.1            RcppAnnoy_0.0.16         httr_1.4.2              
[127] colorspace_2.0-0         rvest_0.3.6              XML_3.99-0.5             fs_1.5.0                 tensor_1.5               reticulate_1.16          splines_4.0.3           
[134] uwot_0.1.8               RBGL_1.64.0              statmod_1.4.34           spatstat.utils_1.17-0    systemfonts_0.3.1        plotly_4.9.2.1           xtable_1.8-4            
[141] jsonlite_1.7.1           spatstat_1.64-1          R6_2.4.1                 pillar_1.4.6             htmltools_0.5.0          mime_0.9                 cpp11_0.2.6             
[148] glue_1.4.2               fastmap_1.1.0            codetools_0.2-18         utf8_1.1.4               lattice_0.20-41          curl_4.3                 leiden_0.3.3            
[155] GO.db_3.11.4             openssl_1.4.3            survival_3.2-7           limma_3.44.3             rmarkdown_2.5            munsell_0.5.0            GetoptLong_1.0.2        
[162] GenomeInfoDbData_1.2.4   haven_2.3.1              reshape2_1.4.4           gtable_0.3.0

DEXSeq • 1.2k views

ADD COMMENT • link 3.4 years ago Angel ▴ 10

score 2 · Accepted Answer · 2021-03-08

Hi Angel,

Thanks for your interest. At the moment, the model used by DEXSeq fits a parameter for each sample. For large datasets like yours, it will be difficult to complete running times unless you throw a lot of CPUs and resources to it (many, many more of what you are already doing). We're trying to fix this but it might take some time to get to a solution. In the meantime, I'd recommend this paper: https://www.biorxiv.org/content/10.1101/2021.01.14.426636v1.full

They benchmark performance and running times of different methods. Seems that their proposed tool satuRn and DoubleExpSeq have similar performance to DEXSeq and don't have problems running in large number of samples. Worth giving them a try IMO.

Alejandro