Question: Why does DEseq2 take a long time to run on large dataset ?
1
6 months ago by
mathias.saver10 wrote:

I've been trying to use DEseq2 to test for differential gene expression on a dataset of RNAseq sample from control and schizophrenia cases.This dataset is roughly ~200 control + ~200 schizophrenia cases/samples. Even if only use "diagnosis" in the design formula ("~ diagnosis"), DEseq2 takes a really long time to run. So far I've used a 72h limit in our cluster computing system, and within that time DEseq2 does not run to completion.

Another detail is this: For some reason, I've found that for this particular data I have that for every gene there's at least one sample with a 0 read count. Upon researching a bit on google, i've found that I could use the parameter type = "iterate" in the estimateSizeFactors function. So that, if "ds" is my DEseqDataset object, the comands I'm using to run DEseq2 are:

ds = estimateSizeFactors(ds, type = "iterate")
dds1 = DESeq( ds, parallel = TRUE)

My questions are:

Would DEseq2 be expected to take this long for a dataset this large ?

How can i improve the time needed for DEseq2 to run?.

Is it "normal" /  "expected" that for every gene there's at least one sample with a read count of 0  for that gene ?

Any ideas/insights you might have that could help me improve this will be very well appreciated !

... and many thanks in advance !

Best,

-Mathias.

deseq2 large dataset • 258 views
modified 6 months ago by Michael Love23k • written 6 months ago by mathias.saver10

One more comment: no, it's not normal that every sample would have a 0 for every gene in bulk RNA-seq. There are housekeeping genes that you'd expect to have positive counts for all samples. There may be failed samples that are being included and should be excluded based on QC. I recommend using MultiQC on FASTQC output.

Answer: Why does DEseq2 take a long time to run on large dataset ?
1
6 months ago by
Michael Love23k
United States
Michael Love23k wrote:

DESeq2 really shouldn’t take so long with just one variable. I have some posts here showing how long it takes for hundreds of samples, but you should easily get it under an hour with only a few cores (below I estimate ~15 minutes for 10,000 genes).

I recommend type=“poscounts” instead of “iterate”. The latter is more for method development but it’s certainly not fast.

ADD COMMENTlink modified 6 months ago • written 6 months ago by Michael Love23k
1

On my machine, 400 samples takes 1.5 minutes per 1000 genes with one core. So that extrapolates for 15 minutes for 10,000 genes (probably you should be around this number if you filter out the unexpressed genes).

> dds <- makeExampleDESeqDataSet(n=1000, m=400)
> system.time({ dds <- DESeq(dds, minRep=Inf) })
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
user  system elapsed
87.690   0.044  87.747

Also while you’re speeding things up, you should remove the genes that don’t have sufficient counts. You can use for such a large sample size:

keep <- rowSums(counts(dds) >= 10) >= n

Where you might choose n of 10 for such a large dataset. Then subset dds before running DESeq()

dds <- dds[keep,]