Question

DESeq2 rlog function takes too long

0

Entering edit mode

bharata1803 ▴ 60

@bharata1803-7698

Last seen 6.8 years ago

Japan

Hello,

I have a quite big readcount matrix form TCGA. The size is 577 samples with number of genes 18.522. When I tried to run DESeq2 to calculate log foldchange, it took not that long, around 3-4 hours. After that, I want to use rlog function to get the log transform of gene expression but it almost take 24 hours and it still not finish. I cancel it because I think it is error.I have Intel® Core™ i7 CPU 975 @ 3.33GHz × 8 with RAM 24 GB. I know that R can not use multiple core to calculate DESeq2. Is there any suggestion how to optimize this process?

deseq2 • 8.2k views

ADD COMMENT • link updated 10.0 years ago by Joseph Bundy ▴ 20 • written 10.0 years ago by bharata1803 ▴ 60

score 5 · Answer 1 · 2016-01-19

In the vignette and the workflow, I suggest to use the VST instead for hundreds of samples:

Note on running time: if you have many samples (e.g. 100s), the rlog function might take too long, and the variance stabilizing transformation might be a better choice. The rlog and VST have similar properties, but the rlog requires fitting a shrinkage term for each sample and each gene which takes time.

EDIT (Oct 2017): the code snippet below is no longer necessary, as the speedup is implemented in the function vst(), since DESeq2 version 1.12.

In addition to this suggestion, here is a snippet of code to speed up the VST even more.

I keep planning to add this to DESeq2 as a proper function, but haven't done so yet.

score 2 · Answer 2 · 2016-01-19

2

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 4 hours ago

WEHI, Melbourne, Australia

You probably already know this, but the rpkm() or cpm() functions in the edgeR package compute log transformed gene expression very quickly. These compute a simple but effective regularized log transformation.

ADD COMMENT • link 10.0 years ago Gordon Smyth 53k

0

Entering edit mode

Thanks for the suggestion. This is the first time I use data these much. I will try your suggestion.

ADD REPLY • link 10.0 years ago bharata1803 ▴ 60

0

Entering edit mode

For the purpose of leaving breadcrumbs, a similar function in DESeq2 is normTransform which divides out library size factors, adds a pseudocount and log2 transforms. This was added when plotPCA was added to BiocGenerics, so that DESeq2::plotPCA could be easily run on a matrix log normalized counts, for comparing various transformation options.

ADD REPLY • link 10.0 years ago Michael Love 43k

score 0 · Answer 3 · 2016-01-20

Hi there,

I've been encountering similar problems with long wait times on certain R functions (especially those in DEXSeq and WGCNA), and I have only 60 samples. If waiting around on R is a problem you're facing often, I might give Intel MKL libraries a look, discussed here: http://brettklamer.com/diversions/statistical/faster-blas-in-r/ It speeds up certain calculations and allows some calculations in R to use multiple cores.

The easiest way to get the libraries is to simply download Revolution R (which is free, and automatically recognized by R-studio):
https://mran.revolutionanalytics.com/download/#download

I gave it a try at my PI's suggestion, and it's cut down on some of the analysis times considerably. Just make sure you install both Revolution R AND the MKL library. Just to be clear, as I realize I sound a bit like a salesman, I am not an employee of Revoltuion Analytic. I just download and used their library because it was advertised as doing mathematical calculations more efficiently and enables multi-threaded calculations (which I have confirmed by watching the task manager).

Unfortunately, the MKL libraries aren't going to help you with your memory (RAM) management, which I suspect is why you're getting an error when doing the rlog transformation. Could you give more information about the error? If you already have one 577 by 18,522 cell matrix in the R workspace, I can't imagine that you have much room for another one. Monitor your memory usage in the task manager next time you try to do the transformation and see if it's at capacity. If it is indeed at capacity, you can attempt to better manage which objects you maintain in the R environment with the rm() and gc() functions. rm() will remove an object, which you specify by name as a single argument, from the R environment, and gc() will ensure that R returns unused memory to the operating system for subsequent calculations. You might also go through your code and make sure that you're not generating too many redundant objects to begin with (if you're like me, you have a lot of them). My current windows installation has 128GB of RAM, and even with all that I've still had to remove certain objects to make room for others (which is admittedly mostly due to my sloppy programming and not the system's fault).

If you still don't have the RAM to run your analysis, I'd recommend simply installing more if your board will support it.