Question: deseq2 DESeqDataSet size when saved
0
gravatar for eric.blanc
13 months ago by
eric.blanc0
eric.blanc0 wrote:
Hi,

I am trying to generate small files to include in my package for regression tests. One of them is a small DESeqDataSet object (object dds_small below, the first 50 features from a complete analysis store in object dds). However, when I save the small object, its size remains very large:

> dds <- readRDS("2018-02-12_all_tissues/dds.rds")
> object.size(dds)
12579016 bytes
> dds_small <- dds[1:50,]
> object.size(dds_small)
111056 bytes
> length(serialize(dds_small, NULL))
[1] 45625706

The size of the small object seems larger than the size of the original object! It seems to be the design slot which uses so much space, as there appears to be an environment attached to it:

> dds_small@design
~(Tissue/Age)/Genotype
<environment: 0x3e64708>
> object.size(dds_small@design)
1344 bytes
> length(serialize(dds_small@design, NULL))
[1] 45353218

This environment probably stores a bunch of packages that were in use when the original object was created, because the sessionInfo (below) reports many loaded packages, although I just did a readRDS command in a fresh R session.

As I am not familiar with environments nor with DESeqDataSet internals, my question is: how should I do to keep my subset object size small?

Thanks for your help,

Eric

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /home/eblanc/R/R-3.5.1/lib/libRblas.so
LAPACK: /home/eblanc/R/R-3.5.1/lib/libRlapack.so

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Biobase_2.40.0              bit64_0.9-7                
 [3] splines_3.5.1               Formula_1.2-3              
 [5] assertthat_0.2.0            stats4_3.5.1               
 [7] latticeExtra_0.6-28         blob_1.1.1                 
 [9] GenomeInfoDbData_1.1.0      pillar_1.3.0               
[11] RSQLite_2.1.1               backports_1.1.2            
[13] lattice_0.20-35             glue_1.3.0                 
[15] digest_0.6.17               GenomicRanges_1.32.7       
[17] RColorBrewer_1.1-2          XVector_0.20.0             
[19] checkmate_1.8.5             colorspace_1.3-2           
[21] htmltools_0.3.6             Matrix_1.2-14              
[23] plyr_1.8.4                  DESeq2_1.20.0              
[25] XML_3.98-1.16               pkgconfig_2.0.2            
[27] rseqCP_0.1.0                genefilter_1.62.0          
[29] zlibbioc_1.26.0             purrr_0.2.5                
[31] xtable_1.8-3                scales_1.0.0               
[33] BiocParallel_1.14.2         htmlTable_1.12             
[35] tibble_1.4.2                annotate_1.58.0            
[37] IRanges_2.14.12             ggplot2_3.0.0              
[39] SummarizedExperiment_1.10.1 nnet_7.3-12                
[41] BiocGenerics_0.26.0         lazyeval_0.2.1             
[43] survival_2.42-3             magrittr_1.5               
[45] crayon_1.3.4                memoise_1.1.0              
[47] foreign_0.8-70              tools_3.5.1                
[49] data.table_1.11.6           matrixStats_0.54.0         
[51] stringr_1.3.1               S4Vectors_0.18.3           
[53] locfit_1.5-9.1              munsell_0.5.0              
[55] cluster_2.0.7-1             DelayedArray_0.6.6         
[57] AnnotationDbi_1.42.1        bindrcpp_0.2.2             
[59] compiler_3.5.1              GenomeInfoDb_1.16.0        
[61] rlang_0.2.2                 grid_3.5.1                 
[63] RCurl_1.95-4.11             rstudioapi_0.7             
[65] htmlwidgets_1.2             bitops_1.0-6               
[67] base64enc_0.1-3             gtable_0.2.0               
[69] DBI_1.0.0                   R6_2.2.2                   
[71] gridExtra_2.3               knitr_1.20                 
[73] dplyr_0.7.6                 bit_1.1-14                 
[75] bindr_0.1.1                 Hmisc_4.1-1                
[77] stringi_1.2.4               parallel_3.5.1             
[79] Rcpp_0.12.18                geneplotter_1.58.0         
[81] rpart_4.1-13                acepack_1.4.1              
[83] tidyselect_0.2.4           

 

ADD COMMENTlink modified 13 months ago by Michael Love26k • written 13 months ago by eric.blanc0
Answer: deseq2 DESeqDataSet size when saved
1
gravatar for Michael Love
13 months ago by
Michael Love26k
United States
Michael Love26k wrote:

There is a thread on here somewhere but I'll just repeat the options here. The limitation is from R's formula() function, and there is a part of it that is unavoidable. You can't call formula() inside of a function, and attach it to an object, because it grabs everything it sees. This would happen whether or not you use DESeqDataSet() or if you were saving your own object, e.g. obj <- list(data, formula), inside of a function.

Let me note for other (users) reading that this issue doesn’t affect normal usage, construction or saving of DESeqDataSets, only when developers call it inside of their own defined functions.

The solutions are:

1) Since version 1.18, you can just provide a matrix to design. This should solve your issue entirely.

2) You could avoid calling formula() or DESeqDataSet() within your function, but instead have the user call it from the global environment. This is what we do in DESeq2, which avoids the dds object that users create being bloated in size. The problem again, is only when you call forumula within a function, attach it to an object, then save it. And this will happen because of R's formula() function and can't easily be avoided.

3) You can delete everything from the environment inside the function using rm(), except the object itself, before return(). There is still some duplication because you can't delete the object itself, so the first two options are preferred.

4) You can also try doing what we do in makeExampleDESeqDataSet(), which is to force formula's environment to the global environment: https://github.com/mikelove/DESeq2/blob/600c6c20fca6c2d54148bea17ac31c424ac69336/R/core.R#L427-L431

ADD COMMENTlink modified 13 months ago • written 13 months ago by Michael Love26k

Thanks Michael, and sorry I wasn't able to find the relevant thread...

 

ADD REPLYlink modified 13 months ago • written 13 months ago by eric.blanc0

I have a hard time finding old threads myself! And (1) is new since the last two versions since I got tired of dealing with formula() and it’s greedy behavior.

ADD REPLYlink modified 13 months ago • written 13 months ago by Michael Love26k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 228 users visited in the last hour