Search
Question: Large disk images of DESeqDataSet / SummarizedExperiment objects when using the save function
2
gravatar for akaever
2.8 years ago by
akaever30
akaever30 wrote:

Hi,

I have encountered unexpected large disk images when exporting DESeqDataSet (subclass of SummarizedExperiment) objects using the save function. The size of the exported file is much larger than expected, which is a problem for large experiments. In the following example, a test object is created and exported. For comparison, also the actual count values (a matrix of integers) are extracted and saved.

In order to test compression, all count values are set to zero and the corresponding object and count matrix are exported once again.

The DESeqDataSet disk image (about 10MB) is much larger than the expected size (about 4MB) and the counts seem not to be compressed by means of the save function. In comparison, the exported count matrix is highly compressed (1.3 MB and 0.03 MB for zero count values). See the following code

 

library(DESeq2)
# DESeq2_1.6.3

# create example data set
dds <- makeExampleDESeqDataSet(n=10000, m=100)
# size should be about 4 MB (10000 x 100 x 4 Bytes (integers))
print(object.size(dds@assays$data@listData$counts))
# 4000200 bytes

# extract counts for comparison
countMat <- counts(dds)
# set counts to zero to test compression
zeroDds <- dds; counts(zeroDds)[] <- 0L
zeroCountMat <- counts(zeroDds)

# save to disk
save(dds, file="dds.RData")
save(countMat, file="countMat.RData")
save(zeroCountMat, file="zeroCountMat.RData")
save(zeroDds, file="zeroDds.RData")

# report file size
for(f in c("dds.RData", "zeroDds.RData", "countMat.RData", "zeroCountMat.RData"))
  cat(f, ": ", file.info(f)$size / 2^20, " MB\n", sep="")
# dds.RData: 10.47201 MB
# zeroDds.RData: 10.50408 MB
# countMat.RData: 1.269134 MB
# zeroCountMat.RData: 0.02977085 MB

 

Maybe, there is somewhere a conversion to double?

Thanks for answers

ADD COMMENTlink modified 2.8 years ago by Wolfgang Huber13k • written 2.8 years ago by akaever30

The large size comes from the 'design' slot

owd = setwd(tempdir())
sort(sapply(slotNames(dds), function(nm) {
    x = slot(dds, nm); save(x, file=nm); file.size(nm)
}))
setwd(owd)

## dispersionFunction           exptData            colData 
##                79                168                316 
##           rowData             assays             design 
##            206618            1316594           10771848

which I think captures the global environment(!).

ADD REPLYlink modified 2.7 years ago • written 2.8 years ago by Martin Morgan ♦♦ 20k
2
gravatar for Michael Love
2.8 years ago by
Michael Love14k
United States
Michael Love14k wrote:

Hi Alexander,

Thanks for the note.

I believe this is only an issue for the makeExampleDESeqDataSet function and not for the "real data" constructors DESeqDataSet*. So if you try this in the global environment with a counts matrix, or SummarizedExperiment or HTSeq input I don't see the same problem.

Within the makeExampleDESeqDataSet function, I had tried to use the following code to have it not store extra copies of all the objects with the design:

designFormula <- as.formula("~ condition",env=new.env())

But I should go back and make sure I've removed the objects, as this is apparently not working.

 

ADD COMMENTlink written 2.8 years ago by Michael Love14k

I think I was mislead when I was testing the above code by the following:

> dds <- makeExampleDESeqDataSet(n=10000,m=100)
> ls(environment(design(dds)))
character(0)

I'll go over the makeExampleDESeqDataSet code and test by saving.

ADD REPLYlink written 2.8 years ago by Michael Love14k

I believe I have solved this problem for makeExampleDESeqDataSet in version 1.7.26, by using:

design <- as.formula("~ condition", env=.GlobalEnv)

Now the design of an example dataset does not include all the other objects from within makeExampleDESeqDataSet(), nor does it include objects from the global environment when saved:

 big.vector <- seq_len(1e7)
 dds <- makeExampleDESeqDataSet(n=10000, m=100)
 sizes <- sapply(slotNames(dds), function(nm) { x = slot(dds, nm); save(x, file=nm); file.info(nm)$size })
 sort(sizes)
 dispersionFunction             design           exptData            colData
                 79                115                168                316
            rowData             assays
             206654            1312688
 sum(sizes)
 [1] 1520020
save(dds, file="object"); file.info("object")$size
 [1] 1519846

 

 

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Michael Love14k

Thanks for all the replies! It works :-), but I do not fully understand why objects in the .GlobalEnv are not automatically saved in this case. I would have tried

design <- as.formula("~ condition", env=emptyenv())

to avoid saving additional objects.

ADD REPLYlink written 2.7 years ago by akaever30
1

This would give an error later on, when trying to use the formula to build a matrix:

> model.matrix(as.formula("~x",env=emptyenv()), data=data.frame(x=c(0,0,1,1)))
 Error in eval(expr, envir, enclos) : could not find function "list"
 Calls: model.matrix ... model.matrix.default -> model.frame -> model.frame.default -> eval -> eval

 

ADD REPLYlink written 2.7 years ago by Michael Love14k
0
gravatar for Wolfgang Huber
2.8 years ago by
EMBL European Molecular Biology Laboratory
Wolfgang Huber13k wrote:

For the record, with Alexander's code from above and DESeq2_1.7.2 6, I now get:

dds.RData: 1.457016 MB
zeroDds.RData: 0.2411022 MB
countMat.RData: 1.270781 MB
zeroCountMat.RData: 0.02954865 MB

Can someone explain why in his initial post (with the previous DESeq2), zeroDds.RData was larger than dds.RData?

 

 

 

 

 

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Wolfgang Huber13k

I think dds assays do not have dimnames, whereas zeroDds do

> str(dds@assays@.xData$data@listData$counts)
 int [1:10000, 1:100] 44 313 34 3 1 3 34 8 11 14 ...
> str(zeroDds@assays@.xData$data@listData$counts)
 int [1:10000, 1:100] 0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:10000] "gene1" "gene2" "gene3" "gene4" ...
  ..$ : NULL

I think this is because of a (unintentional?) omission in the assays<-,SummarizedExperiment,SimpleList-method

> showMethods("assays<-", includeD=T)
Function: assays<- (package GenomicRanges)
x="SummarizedExperiment", value="list"
function (x, ..., withDimnames = TRUE, value) 
{
    value <- lapply(x, "dimnames<-", NULL)
    .SummarizedExperiment.assays.replace(x, ..., value = SimpleList(value))
}

x="SummarizedExperiment", value="SimpleList"
function (x, ..., withDimnames = TRUE, value) 
{
    x <- clone(x, ..., assays = value)
    msg <- .valid.SummarizedExperiment(x)
    if (!is.null(msg)) 
        stop(msg)
    x
}

I think the names should be stripped (SummarizedExperiment manages these internally).

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Martin Morgan ♦♦ 20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 89 users visited in the last hour