Question

Why does DEseq2 take a long time to run on large dataset ?

1

Entering edit mode

mathias.saver ▴ 10

@mathiassaver-18308

Last seen 5.7 years ago

I've been trying to use DEseq2 to test for differential gene expression on a dataset of RNAseq sample from control and schizophrenia cases.This dataset is roughly ~200 control + ~200 schizophrenia cases/samples. Even if only use "diagnosis" in the design formula ("~ diagnosis"), DEseq2 takes a really long time to run. So far I've used a 72h limit in our cluster computing system, and within that time DEseq2 does not run to completion.

Another detail is this: For some reason, I've found that for this particular data I have that for every gene there's at least one sample with a 0 read count. Upon researching a bit on google, i've found that I could use the parameter type = "iterate" in the estimateSizeFactors function. So that, if "ds" is my DEseqDataset object, the comands I'm using to run DEseq2 are:

ds = estimateSizeFactors(ds, type = "iterate")
dds1 = DESeq( ds, parallel = TRUE)

My questions are:

Would DEseq2 be expected to take this long for a dataset this large ?

How can i improve the time needed for DEseq2 to run?.

Is it "normal" / "expected" that for every gene there's at least one sample with a read count of 0 for that gene ?

Any ideas/insights you might have that could help me improve this will be very well appreciated !

... and many thanks in advance !

Best,

-Mathias.

DEseq2 large dataset • 3.7k views

ADD COMMENT • link updated 5.7 years ago by Michael Love 42k • written 5.7 years ago by mathias.saver ▴ 10

0

Entering edit mode

One more comment: no, it's not normal that every sample would have a 0 for every gene in bulk RNA-seq. There are housekeeping genes that you'd expect to have positive counts for all samples. There may be failed samples that are being included and should be excluded based on QC. I recommend using MultiQC on FASTQC output.

ADD REPLY • link 5.7 years ago Michael Love 42k

score 1 · Answer 1 · 2018-11-14

1

Entering edit mode

Michael Love 42k

@mikelove

Last seen 13 hours ago

United States

DESeq2 really shouldn’t take so long with just one variable. I have some posts here showing how long it takes for hundreds of samples, but you should easily get it under an hour with only a few cores (below I estimate ~15 minutes for 10,000 genes).

I recommend type=“poscounts” instead of “iterate”. The latter is more for method development but it’s certainly not fast.

ADD COMMENT • link 5.7 years ago Michael Love 42k

1

Entering edit mode

Also while you’re speeding things up, you should remove the genes that don’t have sufficient counts. You can use for such a large sample size:

keep <- rowSums(counts(dds) >= 10) >= n

Where you might choose n of 10 for such a large dataset. Then subset dds before running DESeq()

dds <- dds[keep,]

ADD REPLY • link 5.7 years ago Michael Love 42k

1

Entering edit mode

On my machine, 400 samples takes 1.5 minutes per 1000 genes with one core. So that extrapolates for 15 minutes for 10,000 genes (probably you should be around this number if you filter out the unexpressed genes).

> dds <- makeExampleDESeqDataSet(n=1000, m=400)
> system.time({ dds <- DESeq(dds, minRep=Inf) })
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
   user  system elapsed 
 87.690   0.044  87.747

ADD REPLY • link 5.7 years ago Michael Love 42k