DESEq2 Large Memory When Parallel
1
1
Entering edit mode
Jeff Granja ▴ 10
@jeff-granja-13923
Last seen 6.6 years ago
Stanford

I have a direct question about DESeq2's memory handling. I have come to really enjoy DESeq2 for ATAC-Seq and it is really well documented etc. The problem is matrices can get quite large in rows which slows down DESeq a lot. This is why the parallelism of DESeq is quite useful. I was curious if there is/was a push for maybe coming up with a way to deal with the large memory handling when it comes to the parallel extensions. Mainly when running the main DESeq pipeline with parallel = TRUE. When this occurs, the memory spikes and it multiplies the memory usage (n * memory) and this can get quite crazy (i have used 300 GB of RAM to finish a DESeq run). I was wondering if ever there would be the potential of adding capabilities with bigmemory matrices so that the memory is reduced. Also if there was a way for the parallelism to not copy the data for every thread. Thanks.

deseq2 • 4.7k views
ADD COMMENT
0
Entering edit mode

What platform is your operating system?

ADD REPLY
0
Entering edit mode

The operating system is Ubuntu 14.04.1 LTS.

ADD REPLY
0
Entering edit mode

Can you give more details, eg the object.size() of ‘dds’ and how you are using BiocParallel?

ADD REPLY
0
Entering edit mode

Hi,

I guess I mistook maybe that the size of the object being big itself but the object.size of dds is 24554008 bytes, the way i run BiocParallel is (threads i normally set to 16-24)

library(BiocParallel)

register(MulticoreParam(threads)) #threads normally 16-24

print("Garbage Collecting")

gc()

dds <- DESeq(dds ,parallel =TRUE, modelMatrixType = "expanded", betaPrior = TRUE)

Which the dimension of the counts matrix is around 400K rows and ~100 samples. I supply custom normalization factors as well.

Thanks!

ADD REPLY
0
Entering edit mode

How many threads?

ADD REPLY
0
Entering edit mode

Normally 16-24

ADD REPLY
0
Entering edit mode

For debugging, can you try a subset, eg 50k rows, with say 4 cores, and see how long that takes and how much memory?

ADD REPLY
0
Entering edit mode

I have been doing approximately that right as we speak. I have been using 8 cores ~60k rows (thresholding) and it took around ~8 hours when 12 cores ~400k rows was about ~24 hours (with ~300 GB of RAM usage). The memory usage scales with the size of the counts 8 cores 60k rows uses 3-4GB per thread while when i ran 12 cores ~400k rows was about 30GB per thread. This made me believe that the matrix memory is the core issue (maybe its getting copied for each thread so the RAM spikes). The speed seems to scale appropriately. Let me know if you want any more tests.

A further question i have is there anyway to maybe allow an option to save progress during DESeq main part? I have had the memory max for bpapply towards the very end which loses a lot of the work. Or an easy way to execute line by line with parallel options maybe a vignette? 

Thanks!

ADD REPLY
0
Entering edit mode

Can you explain a bit about the design? This seems like way too long even for a single core, unless you have some very wide design matrix with hundreds of coefficients to fit. I think there’s some issue. I’ll try to replicate something like a problem of your dimension and report what I get on my machines.

ADD REPLY
0
Entering edit mode

Sure. So the datasets that cause the main problems definitely have lots of coefficients (50+). For this design in particular we have lots of cell types from mulitple donors and the questions we are interested in is the difference between the cell types as a whole and across individuals.

For each individual we have replicates and this leads to a simple design where it is (sorry i am not sure how to properly put this in) sample (Patient_Type_Rep) and a contrast column (Patient_Type) and the design is ~contrast in this case. 

Edit(limited user so i cannot post below for a few hours) In this case there is about ~25 Types with ~3 Patients (~75 Patient_Type) with 2+ replicates so in total ~150 samples.

ADD REPLY
0
Entering edit mode

Can you give me a ballpark on number of patient and type? I’ll try it with mock data on my end and report timings etc.

ADD REPLY
0
Entering edit mode

In this case there is about ~25 Types with ~3 Patients (~75 Patient_Type) with 2+ replicates so in total ~150 samples.

ADD REPLY
0
Entering edit mode
@mikelove
Last seen 1 hour ago
United States

hi Jeff,

I tried the following on my cluster here, using either 4 or 8 cores at a time:

library(DESeq2)
library(BiocParallel)
df <- expand.grid(type=factor(1:25),patient=factor(1:3),rep=1:2)
dds <- makeExampleDESeqDataSet(n=100000,m=150)
colData(dds) <- DataFrame(df)
# ~~~ ...then either this simple design ~~~ 
design(dds) <- ~patient + type
dds <- DESeq(dds, parallel=TRUE, BPPARAM=MulticoreParam(4))
# ~~~ ...or this with patient-specific type effects ~~~
dds$group <- factor(paste0(dds$patient, "_", dds$type))
design(dds) <- ~group
dds <- DESeq(dds, parallel=TRUE, BPPARAM=MulticoreParam(4))

The max memory for 4 cores was 6 GB and the max for 8 cores was 10 GB. Our cluster uses SLURM and I requested span[hosts=1] which "indicates that all the processors allocated to this job must be on the same host". By the way, the reason why the memory gets into the low GBs is because there are many coefficients and statistics being stored within 'dds', beyond the counts and normalization factors, and I think it's hard to restrict what R objects are being sent around to child processes. You could be more explicit in the scripts about breaking apart the dataset into discrete chunks, for example, only the dispersion fit and prior variance steps need to look across all genes (these are fast), and the other steps could occur in totally independent jobs. I think to be certain that objects wouldn't be shared, you'd have to do the splitting manually. I can help if you want to do this route. But my numbers below indicate that you might be better off just lowering the number of cores and maybe filtering the rows a bit if possible.

My timing (I just tried once for each) was 1400 s for simple design / 4 cores, 1100 s for simple design / 8 cores, 2800 s for patient-specific design / 4 cores, 1200 s for patient-specific design / 8 cores. So I would expect ~4x these numbers for memory and timing when you go from 100k to 400k rows.

Do you really have 400k rows with minimal signal across a minimum number of samples? Some modest filtering (e.g. >= 4 samples with normalized counts >= 10) might help brings things to a more manageable state, where it's just taking 20 min to run and using only a few GB.

After saving 'dds' at the completion of the code above, you'll still have to call results() to extract comparisons, and again you'll want to use parallel=TRUE. I would recommend using the above code with version >= 1.16, and then use lfcShrink() to get shrunken LFC for a given 'contrast':

res <- results(dds, contrast=c("group","...","..."), parallel=TRUE)
res <- lfcShrink(dds, contrast=c("group","...","..."), res=res)

In the upcoming release of DESeq2 (v1.18 which will come out at the end of the month), you could replace the two lines above with the following faster line of code, which would build a results table with shrunken LFC for the contrast of interest:

res <- lfcShrink(dds, contrast=c("group","...","..."), parallel=TRUE)
ADD COMMENT
0
Entering edit mode

hi Jeff,

I've been looking into this for the past weeks, and I did time trials for 400k rows. For the complex design above with 150 samples and patient-specific effects for each cell type, the whole pipeline took 1:15 on the clock and used a max of 13 GB, across 8 cores. That seems reasonable, each core using ~1.5 GB.

I found that I could reduce max memory by not using the DESeq() function, but breaking out the steps one by one. This is unfortunate, but I think unavoidable due to the nature of memory sharing when the parallel functions are called within a function (DESeq, and its parallel sub-function) by the general purpose backends 'parallel' and 'snow'. The run time was about the same for using DESeq() with parallel=TRUE but the max memory was 23 GB.

The code I used is pasted below. I think I'll work on the code in DESeq2 functions so that users can copy paste this from the source if it's useful. It's essentially what goes on when you call DESeq() with parallel=TRUE. But I think breaking this code out is only going to give a measurable benefit if you have 100,000s of rows and number of coefficients getting close to 100.

library(BiocParallel)
dds <- estimateSizeFactors(dds)
idx <- factor(sort(rep(1:8,length=nrow(dds))))
dds.list <- lapply(1:8, function(i) dds[idx==i,])
dds.list <- bplapply(dds.list, function(d) {
  suppressPackageStartupMessages(library(DESeq2))
  estimateDispersionsGeneEst(d)
}, BPPARAM=SnowParam(8))
dds <- do.call(rbind, dds.list)
dds <- estimateDispersionsFit(dds)
dispPriorVar <- estimateDispersionsPriorVar(dds)
dds.list <- lapply(1:8, function(i) dds[idx==i,])
dds.list <- bplapply(dds.list, function(d, dispPriorVar) {
  suppressPackageStartupMessages(library(DESeq2))
  d <- estimateDispersionsMAP(d, dispPriorVar=dispPriorVar)
  nbinomWaldTest(d)
}, dispPriorVar, BPPARAM=SnowParam(8))
dds <- do.call(rbind, dds.list)
ADD REPLY
0
Entering edit mode

Hi, 

My matrix has 10k rows and 1k columns. With parallel=F the memory use remains below 20GB but it takes too long to run, with parallel=T and 16 cores the memory use goes up to >300GB and with 8 cores >150GB...Is this expected? 

ADD REPLY
0
Entering edit mode

Yes some growth in the memory use is expected because R ends up duplicating the memory. I’ve tried to see how it can be limited but it seems to be a limitation of the parallel backends, not something I could fix from DESeq2. The best you  can do is clean up the environment as much as possible before running.

Note that limma voom is a much faster approach when you have hundreds of samples. I myself use limma in these situations often.

ADD REPLY
0
Entering edit mode

Yes, I have come to the same conclusion. I will try with limma-voom. Thanks! 

ADD REPLY
0
Entering edit mode

Hi Michael,

So it's possible to utilize SLURM to perform RNA-seq analysis in the cloud/HPC? Since I've used DESeq2 in the past on a single machine, I was wondering if there is any chance you might share a bit more details in how you've set this up at your lab?

Thank you! Mara

ADD REPLY
0
Entering edit mode

One detail is that DESeq2 has been made more efficient in the past releases. What version are you using? How many samples, and what is the design?

ADD REPLY
0
Entering edit mode

R version 4.0.3 DESeq2_1.28.1

I'll be trying to test with as big dataset as I can get from publicly available data and the design will probably be just dds$condition <- factor(dds$condition, levels = c("untreated","treated")) as per your tutorial (http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html)

I'm trying to make a use case utilizing HPC/cloud computing and slurm to see if this is feasible to set up and test the time taken for the analysis on different cloud providers.

ADD REPLY
0
Entering edit mode

I think it really depends on specific use cases though.

Suppose you have 100 samples over 10,000 genes. Running DESeq2 here (I turn off outlier replacement for simplicity) takes 30 seconds on a laptop. So it's really not worth engaging with a cluster scheduler for that type of dataset. The overhead will be orders of magnitude longer than simply running on one machine.

> dds <- makeExampleDESeqDataSet(n=10000, m=100)
> system.time({ DESeq(dds, minRep=Inf) })
   user  system elapsed
 30.494   0.642  31.818

In the devel branch, Constantin has also added fast code for fitting single cell datasets with thousands of cells for example:

> system.time({ DESeq(dds, test="LRT", reduced=~1, fitType="glmGamPoi", minRep=Inf) })
using 'glmGamPoi' as fitType. If used in published research, please cite:
    Ahlmann-Eltze, C., Huber, W. (2020) glmGamPoi: Fitting Gamma-Poisson
    Generalized Linear Models on Single Cell Count Data. bioRxiv.
    https://doi.org/10.1101/2020.08.13.249623
   user  system elapsed
  8.560   0.499   9.160

I wouldn't bother with distributing jobs to a cluster unless there was a compelling use case. As the sample size grows, first I would recommend using glmGamPoi.

ADD REPLY

Login before adding your answer.

Traffic: 760 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6