DESeq function taking too long
3
0
Entering edit mode
sup230 ▴ 30
@sup230-13286
Last seen 7.1 years ago

Hi, 

I am running DESeq function in DEseq2 (version 1.16.1) and it's taking too long . I had it running over night and it's still not done. 

I am thinking it's because my data are too big because I was able to run it with smaller dataset. My countdata is made of 56730 genes from 475 samples and I reduced metadata to include only 4 variables for 475 samples. Previously I had run DESeq with 20000 genes and 280 samples and it did not take more than 15 minutes. I am wondering if this is expected considering the large size or if there are any other ways to make this function run faster. Based on previous post (about 3.3 years ago), I also tried converting all metafile components to factor. Thanks for your input in advance!

 

dds.adj2<-DESeqDataSetFromMatrix(countData = hg38.counts, colData = hg38.coremeta, design=~agequart+muse_IDH1_status+seizure_history)
vsd.adj2<-vst(dds.adj2, blind = T)
dds.adj2<-estimateSizeFactors(dds.adj2)dds.adj2<-DESeq(dds.adj2) 

 

> dds.adj2<-DESeq(dds.adj2)
using pre-existing size factors
estimating dispersions
gene-wise dispersion estimates

 

deseq2 deseq factor rnaseq • 3.9k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 4 days ago
United States
I'd recommend removing genes with very small counts, for example if no more than 3 or 5 samples have normalized counts larger than 10. Also can you try on a subset of genes to give me a sense of the timing, e.g. on 500 or 1000 genes?
ADD COMMENT
0
Entering edit mode

Can you clarify how to remove genes with very small counts? Should I look at normalized counts (not the rawcounts?) and remove the genes with normalized counts less than 10 in most samples? 

I have tried with 100 samples for all 56730 genes and it took about 15 min to do DESeq. Also, when I tried again with the entire countdata, I got an error as below. Thanks!

 

using pre-existing size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
Error: cannot allocate vector of size 205.6 Mb

 

ADD REPLY
0
Entering edit mode
@peter-langfelder-4469
Last seen 5 weeks ago
United States

Your "out of memory" error would suggest your RAM nearly exhausted.  What system are you on? Can you monitor your use of system resources, especially whether your computer needs to use swapping to disk? It may be that you run of physical RAM and the system starts swapping to disk, which causes the calculation to grind to a near halt.

ADD COMMENT
0
Entering edit mode

I am using Windows 10 Pro (v. 1703) 64 bit OS. I got this desktop about a month ago. 

It says that the installed RAM is 8GB and 7.89 GB is usable. I am not completely sure what you meant by swapping to disk..?

ADD REPLY
0
Entering edit mode
@peter-langfelder-4469
Last seen 5 weeks ago
United States

Disclaimer: I am not  a Windows user so my advice may be off the mark, but here it goes anyway.

You should be able to find that in the Resource Monitor, Memory tab. I think you should look at the "Hard faults per second" and the overall RAM utilization (see discussion here: https://superuser.com/questions/508448/how-to-view-windows-equivalent-of-unix-swap-usage). The hard faults measure how much the system swaps between memory and disk (see https://en.wikipedia.org/wiki/Page_fault). If the overall RAM utilization is near 100% and the number of hard faults for the R process is high, it likely means you are exceeding the physical RAM capacity.

ADD COMMENT
0
Entering edit mode

Thanks for the clarification! 

The resource monitor shows 79% used physical memory: 6393 MB in use and 1653 MB available. The number of hard faults as listed under processes is 0 but on the graph to the right side, I see some peaks come and go. Any suggestion of soluiton if this seems to be the cause of problem?

 

ADD REPLY
2
Entering edit mode

I was suggesting to subset genes, not samples. You can use an index such as rowSums(counts(dds,normalized=TRUE) >= 10) >= 5) or fill in a reasonable value instead of 5. For me, datasets on the order of 400 can be computed in less than an hour with DESeq() using e.g. 4 cores with parallel=TRUE. But also, for experiments with ~100s of replicates, I tend to use limma-voom which benefits from a closed form solution, while the GLM in DESeq2 requires iterative convergence.

Update: the function estimateSizeFactors can be used to estimate size factors for library size correction without running DESeq.

ADD REPLY
0
Entering edit mode

Can you elaborate briefly how different the two methods are between DESeq and limma-voom? 

ADD REPLY
1
Entering edit mode

Methods tend to have large overlap as the sample size grows large. See for example Schurch 2016 or our DESeq2 paper. But limma-voom has a large speed advantage when you have 400+ samples as here.

ADD REPLY
0
Entering edit mode

This doesn't seem to make sense. Could you clarify?

https://support.bioconductor.org/p/125781/

ADD REPLY
0
Entering edit mode

Does this happen while you run DESeq on the large data that was causing you trouble? If so, it would seem that swapping is not a problem, in which case I defer further solutions to Michael Love.

ADD REPLY

Login before adding your answer.

Traffic: 536 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6