Excessive memory requirement of variancePartition::dream()
1
0
Entering edit mode
Junghoon ▴ 10
@649f7bcf
Last seen 14 days ago
South Korea

Problem

I posted this question. After a few trials and errors, I found that dream() requires unrealistically large memory. For example, to analyze my 34,155 (genes) x 27,584 (samples) count matrix using the dream pipeline on 4 forked threads, even 500GB RAM is not enough even though the count matrix is just 3.4GB. Based on my understanding of what dream() does (I read this paper), the modeling procedure for each gene is completely independent of the other genes once the precision weights have been calculated by voomWithDreamWeights(). Hence I think there should be no reason for this excessive memory requirement of dream().

Below is the 'dream' part of my script.

dgelist = DGEList(counts = count_data,
                  samples = sample_metadata,
                  group = sample_metadata$sex,
                  genes = gene_metadata)

k = filterByExpr(dgelist)

dgelist = dgelist[k, , keep.lib.sizes = F]

dgelist = calcNormFactors(dgelist, method = "TMM")

form = ~ sex + tissue_cancer + batch + (1|case_id)

v = voomWithDreamWeights(dgelist, form, sample_metadata, BPPARAM = MulticoreParam(12))

dfit = dream(v, form, sample_metadata, computeResiduals = F, BPPARAM = MulticoreParam(4)) # this takes >500GB memory

Questions

So here are my questions.

  1. Is it correct that what dream() does for each gene is completely independent of other genes?
  2. If I manually split the v object into 100 gene subsets (for example 300 genes per each split) and run dream() for each subset separately, will the results be the same as if I ran it as a whole?
  3. Is there any way to reduce the memory requirement of the above script?

Any advice would be very much appreciated!

Session Info

> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] RColorBrewer_1.1-2          variancePartition_1.20.0    BiocParallel_1.24.1         forcats_0.5.1               stringr_1.4.0              
 [6] dplyr_1.0.5                 purrr_0.3.4                 readr_1.4.0                 tidyr_1.1.3                 tibble_3.1.1               
[11] tidyverse_1.3.1             scales_1.1.1                rstudioapi_0.13             magrittr_2.0.1              ggpubr_0.4.0               
[16] ggplot2_3.3.3               fst_0.9.4                   edgeR_3.32.1                limma_3.46.0                e1071_1.7-6                
[21] dtplyr_1.1.0                DESeq2_1.30.1               SummarizedExperiment_1.20.0 Biobase_2.50.0              MatrixGenerics_1.2.1       
[26] matrixStats_0.58.0          GenomicRanges_1.42.0        GenomeInfoDb_1.26.7         IRanges_2.24.1              S4Vectors_0.28.1           
[31] BiocGenerics_0.36.1         data.table_1.14.0           cowplot_1.1.1              

loaded via a namespace (and not attached):
 [1] minqa_1.2.4            colorspace_2.0-0       ggsignif_0.6.1         ellipsis_0.3.1         class_7.3-18           rio_0.5.26            
 [7] colorRamps_2.3         XVector_0.30.0         fs_1.5.0               proxy_0.4-25           bit64_4.0.5            AnnotationDbi_1.52.0  
[13] fansi_0.4.2            lubridate_1.7.10       xml2_1.3.2             codetools_0.2-18       splines_4.0.4          doParallel_1.0.16     
[19] cachem_1.0.4           geneplotter_1.68.0     knitr_1.33             jsonlite_1.7.2         nloptr_1.2.2.2         pbkrtest_0.5.1        
[25] broom_0.7.6            annotate_1.68.0        dbplyr_2.1.1           compiler_4.0.4         httr_1.4.2             backports_1.2.1       
[31] assertthat_0.2.1       Matrix_1.3-2           fastmap_1.1.0          cli_2.4.0              prettyunits_1.1.1      tools_4.0.4           
[37] gtable_0.3.0           glue_1.4.2             GenomeInfoDbData_1.2.4 reshape2_1.4.4         Rcpp_1.0.6             carData_3.0-4         
[43] cellranger_1.1.0       vctrs_0.3.7            nlme_3.1-152           iterators_1.0.13       xfun_0.22              lme4_1.1-26           
[49] openxlsx_4.2.3         rvest_1.0.0            lifecycle_1.0.0        gtools_3.8.2           statmod_1.4.35         rstatix_0.7.0         
[55] XML_3.99-0.6           MASS_7.3-53.1          zlibbioc_1.36.0        hms_1.0.0              curl_4.3               memoise_2.0.0         
[61] stringi_1.5.3          RSQLite_2.2.7          genefilter_1.72.1      foreach_1.5.1          caTools_1.18.2         boot_1.3-27           
[67] zip_2.1.1              rlang_0.4.10           pkgconfig_2.0.3        bitops_1.0-7           lattice_0.20-41        bit_4.0.4             
[73] tidyselect_1.1.0       plyr_1.8.6             R6_2.5.0               gplots_3.1.1           generics_0.1.0         DelayedArray_0.16.3   
[79] DBI_1.1.1              pillar_1.6.0           haven_2.4.1            foreign_0.8-81         withr_2.4.2            survival_3.2-10       
[85] abind_1.4-5            RCurl_1.98-1.3         modelr_0.1.8           crayon_1.4.1           car_3.0-10             KernSmooth_2.23-18    
[91] utf8_1.2.1             progress_1.2.2         locfit_1.5-9.4         grid_4.0.4             readxl_1.3.1           blob_1.2.1            
[97] reprex_2.0.0           xtable_1.8-4           munsell_0.5.0
variancePartition limma BiocParallel • 83 views
ADD COMMENT
0
Entering edit mode
Junghoon ▴ 10
@649f7bcf
Last seen 14 days ago
South Korea

Update: I just did a quick check and confirm the results are the same whether I run dream() on the whole v or on each of manually split chunks of v and then manually merge the results into a complete MArrayLM2 object. It's not elegant, but for now, I think it is the only workaround to avoid excessive memory usage of dream() for a large dataset including thousands of samples and genes such as mine.

ADD COMMENT

Login before adding your answer.

Traffic: 470 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6