Why does DEseq2 take a long time to run on large dataset ?
1
1
Entering edit mode
@mathiassaver-18308
Last seen 5.3 years ago

 I've been trying to use DEseq2 to test for differential gene expression on a dataset of RNAseq sample from control and schizophrenia cases.This dataset is roughly ~200 control + ~200 schizophrenia cases/samples. Even if only use "diagnosis" in the design formula ("~ diagnosis"), DEseq2 takes a really long time to run. So far I've used a 72h limit in our cluster computing system, and within that time DEseq2 does not run to completion.

Another detail is this: For some reason, I've found that for this particular data I have that for every gene there's at least one sample with a 0 read count. Upon researching a bit on google, i've found that I could use the parameter type = "iterate" in the estimateSizeFactors function. So that, if "ds" is my DEseqDataset object, the comands I'm using to run DEseq2 are:

ds = estimateSizeFactors(ds, type = "iterate")
dds1 = DESeq( ds, parallel = TRUE)

My questions are:

Would DEseq2 be expected to take this long for a dataset this large ?

How can i improve the time needed for DEseq2 to run?.

Is it "normal" /  "expected" that for every gene there's at least one sample with a read count of 0  for that gene ?

Any ideas/insights you might have that could help me improve this will be very well appreciated !

... and many thanks in advance !

Best,

-Mathias.

DEseq2 large dataset • 3.3k views
ADD COMMENT
0
Entering edit mode

One more comment: no, it's not normal that every sample would have a 0 for every gene in bulk RNA-seq. There are housekeeping genes that you'd expect to have positive counts for all samples. There may be failed samples that are being included and should be excluded based on QC. I recommend using MultiQC on FASTQC output.

ADD REPLY
1
Entering edit mode
@mikelove
Last seen 18 hours ago
United States

DESeq2 really shouldn’t take so long with just one variable. I have some posts here showing how long it takes for hundreds of samples, but you should easily get it under an hour with only a few cores (below I estimate ~15 minutes for 10,000 genes).

I recommend type=“poscounts” instead of “iterate”. The latter is more for method development but it’s certainly not fast.

ADD COMMENT
1
Entering edit mode

Also while you’re speeding things up, you should remove the genes that don’t have sufficient counts. You can use for such a large sample size:

keep <- rowSums(counts(dds) >= 10) >= n

Where you might choose n of 10 for such a large dataset. Then subset dds before running DESeq()

dds <- dds[keep,]

ADD REPLY
1
Entering edit mode

On my machine, 400 samples takes 1.5 minutes per 1000 genes with one core. So that extrapolates for 15 minutes for 10,000 genes (probably you should be around this number if you filter out the unexpressed genes).

> dds <- makeExampleDESeqDataSet(n=1000, m=400)
> system.time({ dds <- DESeq(dds, minRep=Inf) })
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
   user  system elapsed 
 87.690   0.044  87.747
ADD REPLY

Login before adding your answer.

Traffic: 696 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6