Question: Large disk images of DESeqDataSet / SummarizedExperiment objects when using the save function
gravatar for akaever
3.7 years ago by
akaever30 wrote:


I have encountered unexpected large disk images when exporting DESeqDataSet (subclass of SummarizedExperiment) objects using the save function. The size of the exported file is much larger than expected, which is a problem for large experiments. In the following example, a test object is created and exported. For comparison, also the actual count values (a matrix of integers) are extracted and saved.

In order to test compression, all count values are set to zero and the corresponding object and count matrix are exported once again.

The DESeqDataSet disk image (about 10MB) is much larger than the expected size (about 4MB) and the counts seem not to be compressed by means of the save function. In comparison, the exported count matrix is highly compressed (1.3 MB and 0.03 MB for zero count values). See the following code


# DESeq2_1.6.3

# create example data set
dds <- makeExampleDESeqDataSet(n=10000, m=100)
# size should be about 4 MB (10000 x 100 x 4 Bytes (integers))
# 4000200 bytes

# extract counts for comparison
countMat <- counts(dds)
# set counts to zero to test compression
zeroDds <- dds; counts(zeroDds)[] <- 0L
zeroCountMat <- counts(zeroDds)

# save to disk
save(dds, file="dds.RData")
save(countMat, file="countMat.RData")
save(zeroCountMat, file="zeroCountMat.RData")
save(zeroDds, file="zeroDds.RData")

# report file size
for(f in c("dds.RData", "zeroDds.RData", "countMat.RData", "zeroCountMat.RData"))
  cat(f, ": ",$size / 2^20, " MB\n", sep="")
# dds.RData: 10.47201 MB
# zeroDds.RData: 10.50408 MB
# countMat.RData: 1.269134 MB
# zeroCountMat.RData: 0.02977085 MB


Maybe, there is somewhere a conversion to double?

Thanks for answers

ADD COMMENTlink modified 3.7 years ago by Wolfgang Huber13k • written 3.7 years ago by akaever30

The large size comes from the 'design' slot

owd = setwd(tempdir())
sort(sapply(slotNames(dds), function(nm) {
    x = slot(dds, nm); save(x, file=nm); file.size(nm)

## dispersionFunction           exptData            colData 
##                79                168                316 
##           rowData             assays             design 
##            206618            1316594           10771848

which I think captures the global environment(!).

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Martin Morgan ♦♦ 22k
gravatar for Michael Love
3.7 years ago by
Michael Love19k
United States
Michael Love19k wrote:

Hi Alexander,

Thanks for the note.

I believe this is only an issue for the makeExampleDESeqDataSet function and not for the "real data" constructors DESeqDataSet*. So if you try this in the global environment with a counts matrix, or SummarizedExperiment or HTSeq input I don't see the same problem.

Within the makeExampleDESeqDataSet function, I had tried to use the following code to have it not store extra copies of all the objects with the design:

designFormula <- as.formula("~ condition",env=new.env())

But I should go back and make sure I've removed the objects, as this is apparently not working.


ADD COMMENTlink written 3.7 years ago by Michael Love19k

I think I was mislead when I was testing the above code by the following:

> dds <- makeExampleDESeqDataSet(n=10000,m=100)
> ls(environment(design(dds)))

I'll go over the makeExampleDESeqDataSet code and test by saving.

ADD REPLYlink written 3.7 years ago by Michael Love19k

I believe I have solved this problem for makeExampleDESeqDataSet in version 1.7.26, by using:

design <- as.formula("~ condition", env=.GlobalEnv)

Now the design of an example dataset does not include all the other objects from within makeExampleDESeqDataSet(), nor does it include objects from the global environment when saved:

 big.vector <- seq_len(1e7)
 dds <- makeExampleDESeqDataSet(n=10000, m=100)
 sizes <- sapply(slotNames(dds), function(nm) { x = slot(dds, nm); save(x, file=nm);$size })
 dispersionFunction             design           exptData            colData
                 79                115                168                316
            rowData             assays
             206654            1312688
 [1] 1520020
save(dds, file="object");"object")$size
 [1] 1519846



ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Michael Love19k

Thanks for all the replies! It works :-), but I do not fully understand why objects in the .GlobalEnv are not automatically saved in this case. I would have tried

design <- as.formula("~ condition", env=emptyenv())

to avoid saving additional objects.

ADD REPLYlink written 3.7 years ago by akaever30

This would give an error later on, when trying to use the formula to build a matrix:

> model.matrix(as.formula("~x",env=emptyenv()), data=data.frame(x=c(0,0,1,1)))
 Error in eval(expr, envir, enclos) : could not find function "list"
 Calls: model.matrix ... model.matrix.default -> model.frame -> model.frame.default -> eval -> eval


ADD REPLYlink written 3.7 years ago by Michael Love19k
gravatar for Wolfgang Huber
3.7 years ago by
EMBL European Molecular Biology Laboratory
Wolfgang Huber13k wrote:

For the record, with Alexander's code from above and DESeq2_1.7.2 6, I now get:

dds.RData: 1.457016 MB
zeroDds.RData: 0.2411022 MB
countMat.RData: 1.270781 MB
zeroCountMat.RData: 0.02954865 MB

Can someone explain why in his initial post (with the previous DESeq2), zeroDds.RData was larger than dds.RData?






ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Wolfgang Huber13k

I think dds assays do not have dimnames, whereas zeroDds do

> str(dds@assays@.xData$data@listData$counts)
 int [1:10000, 1:100] 44 313 34 3 1 3 34 8 11 14 ...
> str(zeroDds@assays@.xData$data@listData$counts)
 int [1:10000, 1:100] 0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:10000] "gene1" "gene2" "gene3" "gene4" ...
  ..$ : NULL

I think this is because of a (unintentional?) omission in the assays<-,SummarizedExperiment,SimpleList-method

> showMethods("assays<-", includeD=T)
Function: assays<- (package GenomicRanges)
x="SummarizedExperiment", value="list"
function (x, ..., withDimnames = TRUE, value) 
    value <- lapply(x, "dimnames<-", NULL)
    .SummarizedExperiment.assays.replace(x, ..., value = SimpleList(value))

x="SummarizedExperiment", value="SimpleList"
function (x, ..., withDimnames = TRUE, value) 
    x <- clone(x, ..., assays = value)
    msg <- .valid.SummarizedExperiment(x)
    if (!is.null(msg)) 

I think the names should be stripped (SummarizedExperiment manages these internally).

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Martin Morgan ♦♦ 22k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 426 users visited in the last hour