Search
Question: DESeq function taking too long
0
gravatar for sup230
4 months ago by
sup23010
sup23010 wrote:

Hi, 

I am running DESeq function in DEseq2 (version 1.16.1) and it's taking too long . I had it running over night and it's still not done. 

I am thinking it's because my data are too big because I was able to run it with smaller dataset. My countdata is made of 56730 genes from 475 samples and I reduced metadata to include only 4 variables for 475 samples. Previously I had run DESeq with 20000 genes and 280 samples and it did not take more than 15 minutes. I am wondering if this is expected considering the large size or if there are any other ways to make this function run faster. Based on previous post (about 3.3 years ago), I also tried converting all metafile components to factor. Thanks for your input in advance!

 

dds.adj2<-DESeqDataSetFromMatrix(countData = hg38.counts, colData = hg38.coremeta, design=~agequart+muse_IDH1_status+seizure_history)
vsd.adj2<-vst(dds.adj2, blind = T)
dds.adj2<-estimateSizeFactors(dds.adj2)dds.adj2<-DESeq(dds.adj2) 

 

> dds.adj2<-DESeq(dds.adj2)
using pre-existing size factors
estimating dispersions
gene-wise dispersion estimates

 

ADD COMMENTlink modified 4 months ago by Peter Langfelder1.3k • written 4 months ago by sup23010
0
gravatar for Michael Love
4 months ago by
Michael Love15k
United States
Michael Love15k wrote:
I'd recommend removing genes with very small counts, for example if no more than 3 or 5 samples have normalized counts larger than 10. Also can you try on a subset of genes to give me a sense of the timing, e.g. on 500 or 1000 genes?
ADD COMMENTlink written 4 months ago by Michael Love15k

Can you clarify how to remove genes with very small counts? Should I look at normalized counts (not the rawcounts?) and remove the genes with normalized counts less than 10 in most samples? 

I have tried with 100 samples for all 56730 genes and it took about 15 min to do DESeq. Also, when I tried again with the entire countdata, I got an error as below. Thanks!

 

using pre-existing size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
Error: cannot allocate vector of size 205.6 Mb

 

ADD REPLYlink written 4 months ago by sup23010
0
gravatar for Peter Langfelder
4 months ago by
United States
Peter Langfelder1.3k wrote:

Your "out of memory" error would suggest your RAM nearly exhausted.  What system are you on? Can you monitor your use of system resources, especially whether your computer needs to use swapping to disk? It may be that you run of physical RAM and the system starts swapping to disk, which causes the calculation to grind to a near halt.

ADD COMMENTlink written 4 months ago by Peter Langfelder1.3k

I am using Windows 10 Pro (v. 1703) 64 bit OS. I got this desktop about a month ago. 

It says that the installed RAM is 8GB and 7.89 GB is usable. I am not completely sure what you meant by swapping to disk..?

ADD REPLYlink written 4 months ago by sup23010
0
gravatar for Peter Langfelder
4 months ago by
United States
Peter Langfelder1.3k wrote:

Disclaimer: I am not  a Windows user so my advice may be off the mark, but here it goes anyway.

You should be able to find that in the Resource Monitor, Memory tab. I think you should look at the "Hard faults per second" and the overall RAM utilization (see discussion here: https://superuser.com/questions/508448/how-to-view-windows-equivalent-of-unix-swap-usage). The hard faults measure how much the system swaps between memory and disk (see https://en.wikipedia.org/wiki/Page_fault). If the overall RAM utilization is near 100% and the number of hard faults for the R process is high, it likely means you are exceeding the physical RAM capacity.

ADD COMMENTlink written 4 months ago by Peter Langfelder1.3k

Thanks for the clarification! 

The resource monitor shows 79% used physical memory: 6393 MB in use and 1653 MB available. The number of hard faults as listed under processes is 0 but on the graph to the right side, I see some peaks come and go. Any suggestion of soluiton if this seems to be the cause of problem?

 

ADD REPLYlink written 4 months ago by sup23010
2
I was suggesting to subset genes, not samples. You can use an index such as rowSums(counts(dds,normalized=TRUE) >= 10) >= 5) or fill in a reasonable value instead of 5. For me, datasets on the order of 400 can be computed in less than an hour with DESeq() using e.g. 4 cores with parallel=TRUE. But also, for experiments with ~100s of replicates, I tend to use limma-voom which benefits from a closed form solution, while the GLM in DESeq2 requires iterative convergence.
ADD REPLYlink written 4 months ago by Michael Love15k

Can you elaborate briefly how different the two methods are between DESeq and limma-voom? 

ADD REPLYlink written 4 months ago by sup23010
1

Methods tend to have large overlap as the sample size grows large. See for example Schurch 2016 or our DESeq2 paper. But limma-voom has a large speed advantage when you have 400+ samples as here.

ADD REPLYlink written 4 months ago by Michael Love15k

Does this happen while you run DESeq on the large data that was causing you trouble? If so, it would seem that swapping is not a problem, in which case I defer further solutions to Michael Love.

ADD REPLYlink written 4 months ago by Peter Langfelder1.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 180 users visited in the last hour