Search
Question: Why does DEseq2 take a long time to run on large dataset ?
1
gravatar for mathias.saver
27 days ago by
mathias.saver10 wrote:

 I've been trying to use DEseq2 to test for differential gene expression on a dataset of RNAseq sample from control and schizophrenia cases.This dataset is roughly ~200 control + ~200 schizophrenia cases/samples. Even if only use "diagnosis" in the design formula ("~ diagnosis"), DEseq2 takes a really long time to run. So far I've used a 72h limit in our cluster computing system, and within that time DEseq2 does not run to completion.

Another detail is this: For some reason, I've found that for this particular data I have that for every gene there's at least one sample with a 0 read count. Upon researching a bit on google, i've found that I could use the parameter type = "iterate" in the estimateSizeFactors function. So that, if "ds" is my DEseqDataset object, the comands I'm using to run DEseq2 are:

ds = estimateSizeFactors(ds, type = "iterate")
dds1 = DESeq( ds, parallel = TRUE)

My questions are:

Would DEseq2 be expected to take this long for a dataset this large ?

How can i improve the time needed for DEseq2 to run?.

Is it "normal" /  "expected" that for every gene there's at least one sample with a read count of 0  for that gene ?

Any ideas/insights you might have that could help me improve this will be very well appreciated !

... and many thanks in advance !

Best,

-Mathias.

ADD COMMENTlink modified 27 days ago by Michael Love20k • written 27 days ago by mathias.saver10

One more comment: no, it's not normal that every sample would have a 0 for every gene in bulk RNA-seq. There are housekeeping genes that you'd expect to have positive counts for all samples. There may be failed samples that are being included and should be excluded based on QC. I recommend using MultiQC on FASTQC output.

ADD REPLYlink written 27 days ago by Michael Love20k
1
gravatar for Michael Love
27 days ago by
Michael Love20k
United States
Michael Love20k wrote:

DESeq2 really shouldn’t take so long with just one variable. I have some posts here showing how long it takes for hundreds of samples, but you should easily get it under an hour with only a few cores (below I estimate ~15 minutes for 10,000 genes).

I recommend type=“poscounts” instead of “iterate”. The latter is more for method development but it’s certainly not fast.

ADD COMMENTlink modified 27 days ago • written 27 days ago by Michael Love20k
1

On my machine, 400 samples takes 1.5 minutes per 1000 genes with one core. So that extrapolates for 15 minutes for 10,000 genes (probably you should be around this number if you filter out the unexpressed genes).

> dds <- makeExampleDESeqDataSet(n=1000, m=400)
> system.time({ dds <- DESeq(dds, minRep=Inf) })
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
   user  system elapsed 
 87.690   0.044  87.747
ADD REPLYlink written 27 days ago by Michael Love20k

Also while you’re speeding things up, you should remove the genes that don’t have sufficient counts. You can use for such a large sample size:

keep <- rowSums(counts(dds) >= 10) >= n

Where you might choose n of 10 for such a large dataset. Then subset dds before running DESeq()

dds <- dds[keep,]

ADD REPLYlink written 27 days ago by Michael Love20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 355 users visited in the last hour