Search
Question: DESEq2 Large Memory When Parallel
0
gravatar for Jeff Granja
27 days ago by
Stanford
Jeff Granja0 wrote:

I have a direct question about DESeq2's memory handling. I have come to really enjoy DESeq2 for ATAC-Seq and it is really well documented etc. The problem is matrices can get quite large in rows which slows down DESeq a lot. This is why the parallelism of DESeq is quite useful. I was curious if there is/was a push for maybe coming up with a way to deal with the large memory handling when it comes to the parallel extensions. Mainly when running the main DESeq pipeline with parallel = TRUE. When this occurs, the memory spikes and it multiplies the memory usage (n * memory) and this can get quite crazy (i have used 300 GB of RAM to finish a DESeq run). I was wondering if ever there would be the potential of adding capabilities with bigmemory matrices so that the memory is reduced. Also if there was a way for the parallelism to not copy the data for every thread. Thanks.

ADD COMMENTlink modified 25 days ago by Michael Love15k • written 27 days ago by Jeff Granja0

What platform is your operating system?

ADD REPLYlink written 27 days ago by Martin Morgan ♦♦ 20k

The operating system is Ubuntu 14.04.1 LTS.

ADD REPLYlink written 26 days ago by Jeff Granja0

Can you give more details, eg the object.size() of ‘dds’ and how you are using BiocParallel?

ADD REPLYlink written 27 days ago by Michael Love15k

Hi,

I guess I mistook maybe that the size of the object being big itself but the object.size of dds is 24554008 bytes, the way i run BiocParallel is (threads i normally set to 16-24)

library(BiocParallel)

register(MulticoreParam(threads)) #threads normally 16-24

print("Garbage Collecting")

gc()

dds <- DESeq(dds ,parallel =TRUE, modelMatrixType = "expanded", betaPrior = TRUE)

Which the dimension of the counts matrix is around 400K rows and ~100 samples. I supply custom normalization factors as well.

Thanks!

ADD REPLYlink modified 26 days ago • written 26 days ago by Jeff Granja0

How many threads?

ADD REPLYlink written 26 days ago by Michael Love15k

Normally 16-24

ADD REPLYlink written 26 days ago by Jeff Granja0

For debugging, can you try a subset, eg 50k rows, with say 4 cores, and see how long that takes and how much memory?

ADD REPLYlink written 26 days ago by Michael Love15k

I have been doing approximately that right as we speak. I have been using 8 cores ~60k rows (thresholding) and it took around ~8 hours when 12 cores ~400k rows was about ~24 hours (with ~300 GB of RAM usage). The memory usage scales with the size of the counts 8 cores 60k rows uses 3-4GB per thread while when i ran 12 cores ~400k rows was about 30GB per thread. This made me believe that the matrix memory is the core issue (maybe its getting copied for each thread so the RAM spikes). The speed seems to scale appropriately. Let me know if you want any more tests.

A further question i have is there anyway to maybe allow an option to save progress during DESeq main part? I have had the memory max for bpapply towards the very end which loses a lot of the work. Or an easy way to execute line by line with parallel options maybe a vignette? 

Thanks!

ADD REPLYlink modified 26 days ago • written 26 days ago by Jeff Granja0

Can you explain a bit about the design? This seems like way too long even for a single core, unless you have some very wide design matrix with hundreds of coefficients to fit. I think there’s some issue. I’ll try to replicate something like a problem of your dimension and report what I get on my machines.

ADD REPLYlink written 26 days ago by Michael Love15k

Sure. So the datasets that cause the main problems definitely have lots of coefficients (50+). For this design in particular we have lots of cell types from mulitple donors and the questions we are interested in is the difference between the cell types as a whole and across individuals.

For each individual we have replicates and this leads to a simple design where it is (sorry i am not sure how to properly put this in) sample (Patient_Type_Rep) and a contrast column (Patient_Type) and the design is ~contrast in this case. 

Edit(limited user so i cannot post below for a few hours) In this case there is about ~25 Types with ~3 Patients (~75 Patient_Type) with 2+ replicates so in total ~150 samples.

ADD REPLYlink modified 26 days ago • written 26 days ago by Jeff Granja0

Can you give me a ballpark on number of patient and type? I’ll try it with mock data on my end and report timings etc.

ADD REPLYlink written 26 days ago by Michael Love15k

In this case there is about ~25 Types with ~3 Patients (~75 Patient_Type) with 2+ replicates so in total ~150 samples.

ADD REPLYlink written 26 days ago by Jeff Granja0
0
gravatar for Michael Love
25 days ago by
Michael Love15k
United States
Michael Love15k wrote:

hi Jeff,

I tried the following on my cluster here, using either 4 or 8 cores at a time:

library(DESeq2)
library(BiocParallel)
df <- expand.grid(type=factor(1:25),patient=factor(1:3),rep=1:2)
dds <- makeExampleDESeqDataSet(n=100000,m=150)
colData(dds) <- DataFrame(df)
# ~~~ ...then either this simple design ~~~ 
design(dds) <- ~patient + type
dds <- DESeq(dds, parallel=TRUE, BPPARAM=MulticoreParam(4))
# ~~~ ...or this with patient-specific type effects ~~~
dds$group <- factor(paste0(dds$patient, "_", dds$type))
design(dds) <- ~group
dds <- DESeq(dds, parallel=TRUE, BPPARAM=MulticoreParam(4))

The max memory for 4 cores was 6 GB and the max for 8 cores was 10 GB. Our cluster uses SLURM and I requested span[hosts=1] which "indicates that all the processors allocated to this job must be on the same host". By the way, the reason why the memory gets into the low GBs is because there are many coefficients and statistics being stored within 'dds', beyond the counts and normalization factors, and I think it's hard to restrict what R objects are being sent around to child processes. You could be more explicit in the scripts about breaking apart the dataset into discrete chunks, for example, only the dispersion fit and prior variance steps need to look across all genes (these are fast), and the other steps could occur in totally independent jobs. I think to be certain that objects wouldn't be shared, you'd have to do the splitting manually. I can help if you want to do this route. But my numbers below indicate that you might be better off just lowering the number of cores and maybe filtering the rows a bit if possible.

My timing (I just tried once for each) was 1400 s for simple design / 4 cores, 1100 s for simple design / 8 cores, 2800 s for patient-specific design / 4 cores, 1200 s for patient-specific design / 8 cores. So I would expect ~4x these numbers for memory and timing when you go from 100k to 400k rows.

Do you really have 400k rows with minimal signal across a minimum number of samples? Some modest filtering (e.g. >= 4 samples with normalized counts >= 10) might help brings things to a more manageable state, where it's just taking 20 min to run and using only a few GB.

After saving 'dds' at the completion of the code above, you'll still have to call results() to extract comparisons, and again you'll want to use parallel=TRUE. I would recommend using the above code with version >= 1.16, and then use lfcShrink() to get shrunken LFC for a given 'contrast':

res <- results(dds, contrast=c("group","...","..."), parallel=TRUE)
res <- lfcShrink(dds, contrast=c("group","...","..."), res=res)

In the upcoming release of DESeq2 (v1.18 which will come out at the end of the month), you could replace the two lines above with the following faster line of code, which would build a results table with shrunken LFC for the contrast of interest:

res <- lfcShrink(dds, contrast=c("group","...","..."), parallel=TRUE)
ADD COMMENTlink written 25 days ago by Michael Love15k

hi Jeff,

I've been looking into this for the past weeks, and I did time trials for 400k rows. For the complex design above with 150 samples and patient-specific effects for each cell type, the whole pipeline took 1:15 on the clock and used a max of 13 GB, across 8 cores. That seems reasonable, each core using ~1.5 GB.

I found that I could reduce max memory by not using the DESeq() function, but breaking out the steps one by one. This is unfortunate, but I think unavoidable due to the nature of memory sharing when the parallel functions are called within a function (DESeq, and its parallel sub-function) by the general purpose backends 'parallel' and 'snow'. The run time was about the same for using DESeq() with parallel=TRUE but the max memory was 23 GB.

The code I used is pasted below. I think I'll work on the code in DESeq2 functions so that users can copy paste this from the source if it's useful. It's essentially what goes on when you call DESeq() with parallel=TRUE. But I think breaking this code out is only going to give a measurable benefit if you have 100,000s of rows and number of coefficients getting close to 100.

library(BiocParallel)
dds <- estimateSizeFactors(dds)
idx <- factor(sort(rep(1:8,length=nrow(dds))))
dds.list <- lapply(1:8, function(i) dds[idx==i,])
dds.list <- bplapply(dds.list, function(d) {
  suppressPackageStartupMessages(library(DESeq2))
  estimateDispersionsGeneEst(d)
}, BPPARAM=SnowParam(8))
dds <- do.call(rbind, dds.list)
dds <- estimateDispersionsFit(dds)
dispPriorVar <- estimateDispersionsPriorVar(dds)
dds.list <- lapply(1:8, function(i) dds[idx==i,])
dds.list <- bplapply(dds.list, function(d, dispPriorVar) {
  suppressPackageStartupMessages(library(DESeq2))
  d <- estimateDispersionsMAP(d, dispPriorVar=dispPriorVar)
  nbinomWaldTest(d)
}, dispPriorVar, BPPARAM=SnowParam(8))
dds <- do.call(rbind, dds.list)
ADD REPLYlink written 5 days ago by Michael Love15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 384 users visited in the last hour