Problem

Question

Excessive memory requirement of variancePartition::dream()

0

Entering edit mode

Junghoon ▴ 10

@649f7bcf

Last seen 4.3 years ago

South Korea

Problem

I posted this question. After a few tries and errors, I found that dream() requires unrealistically large memory. For example, to analyze my 34,155 (genes) x 27,584 (samples) count matrix using the dream pipeline on 4 forked threads, even 500GB RAM is not enough even though the count matrix is just 3.4GB. Based on my understanding of what dream() does (I read this paper), the modeling procedure for each gene is completely independent of the other genes once the precision weights have been calculated by voomWithDreamWeights(). Hence I think there should be no reason for this excessive memory requirement of dream().

Below is the 'dream' part of my script.

dgelist = DGEList(counts = count_data,
                  samples = sample_metadata,
                  group = sample_metadata$sex,
                  genes = gene_metadata)

k = filterByExpr(dgelist)

dgelist = dgelist[k, , keep.lib.sizes = F]

dgelist = calcNormFactors(dgelist, method = "TMM")

form = ~ sex + tissue_cancer + batch + (1|case_id)

v = voomWithDreamWeights(dgelist, form, sample_metadata, BPPARAM = MulticoreParam(12))

dfit = dream(v, form, sample_metadata, computeResiduals = F, BPPARAM = MulticoreParam(4)) # this takes >500GB memory

Questions

So here are my questions.

Is it correct that what dream() does for each gene is completely independent of other genes?
If I manually split the v object into 100 gene subsets (for example 300 genes per each split) and run dream() for each subset separately, will the results be the same as if I ran it as a whole?
Is there any way to reduce the memory requirement of the above script?

Any advice would be very much appreciated!

Session Info

> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] RColorBrewer_1.1-2          variancePartition_1.20.0    BiocParallel_1.24.1         forcats_0.5.1               stringr_1.4.0              
 [6] dplyr_1.0.5                 purrr_0.3.4                 readr_1.4.0                 tidyr_1.1.3                 tibble_3.1.1               
[11] tidyverse_1.3.1             scales_1.1.1                rstudioapi_0.13             magrittr_2.0.1              ggpubr_0.4.0               
[16] ggplot2_3.3.3               fst_0.9.4                   edgeR_3.32.1                limma_3.46.0                e1071_1.7-6                
[21] dtplyr_1.1.0                DESeq2_1.30.1               SummarizedExperiment_1.20.0 Biobase_2.50.0              MatrixGenerics_1.2.1       
[26] matrixStats_0.58.0          GenomicRanges_1.42.0        GenomeInfoDb_1.26.7         IRanges_2.24.1              S4Vectors_0.28.1           
[31] BiocGenerics_0.36.1         data.table_1.14.0           cowplot_1.1.1              

loaded via a namespace (and not attached):
 [1] minqa_1.2.4            colorspace_2.0-0       ggsignif_0.6.1         ellipsis_0.3.1         class_7.3-18           rio_0.5.26            
 [7] colorRamps_2.3         XVector_0.30.0         fs_1.5.0               proxy_0.4-25           bit64_4.0.5            AnnotationDbi_1.52.0  
[13] fansi_0.4.2            lubridate_1.7.10       xml2_1.3.2             codetools_0.2-18       splines_4.0.4          doParallel_1.0.16     
[19] cachem_1.0.4           geneplotter_1.68.0     knitr_1.33             jsonlite_1.7.2         nloptr_1.2.2.2         pbkrtest_0.5.1        
[25] broom_0.7.6            annotate_1.68.0        dbplyr_2.1.1           compiler_4.0.4         httr_1.4.2             backports_1.2.1       
[31] assertthat_0.2.1       Matrix_1.3-2           fastmap_1.1.0          cli_2.4.0              prettyunits_1.1.1      tools_4.0.4           
[37] gtable_0.3.0           glue_1.4.2             GenomeInfoDbData_1.2.4 reshape2_1.4.4         Rcpp_1.0.6             carData_3.0-4         
[43] cellranger_1.1.0       vctrs_0.3.7            nlme_3.1-152           iterators_1.0.13       xfun_0.22              lme4_1.1-26           
[49] openxlsx_4.2.3         rvest_1.0.0            lifecycle_1.0.0        gtools_3.8.2           statmod_1.4.35         rstatix_0.7.0         
[55] XML_3.99-0.6           MASS_7.3-53.1          zlibbioc_1.36.0        hms_1.0.0              curl_4.3               memoise_2.0.0         
[61] stringi_1.5.3          RSQLite_2.2.7          genefilter_1.72.1      foreach_1.5.1          caTools_1.18.2         boot_1.3-27           
[67] zip_2.1.1              rlang_0.4.10           pkgconfig_2.0.3        bitops_1.0-7           lattice_0.20-41        bit_4.0.4             
[73] tidyselect_1.1.0       plyr_1.8.6             R6_2.5.0               gplots_3.1.1           generics_0.1.0         DelayedArray_0.16.3   
[79] DBI_1.1.1              pillar_1.6.0           haven_2.4.1            foreign_0.8-81         withr_2.4.2            survival_3.2-10       
[85] abind_1.4-5            RCurl_1.98-1.3         modelr_0.1.8           crayon_1.4.1           car_3.0-10             KernSmooth_2.23-18    
[91] utf8_1.2.1             progress_1.2.2         locfit_1.5-9.4         grid_4.0.4             readxl_1.3.1           blob_1.2.1            
[97] reprex_2.0.0           xtable_1.8-4           munsell_0.5.0

BiocParallel limma variancePartition • 1.7k views

ADD COMMENT • link 4.8 years ago • updated 4.6 years ago Junghoon ▴ 10

score 0 · Answer 1 · 2021-04-29

Update: I just did a quick check and confirm the results are the same whether I run dream() on the whole v or on each of manually split chunks of v and then manually merge the results into a complete MArrayLM2 object. It's not elegant, but for now, I think it is the only workaround to avoid excessive memory usage of dream() for a large dataset including thousands of samples and genes such as mine.