Here is some example timing in the code chunk at the bottom. Running DESeq() on 1000 genes for 300 samples across 10 conditions takes <2 minutes, and you can easily run this in parallel over multiple cores.
Another thing to help the speed: you can make sure you're not running DESeq() on genes with little power by pre-filtering on normalized counts. While this is not necessary for the statistical method, and strict pre-filtering isn't part of our vignette code, it can speed things up a lot because there are many genes with small counts.
dds <- estimateSizeFactors(dds)
# here filtering on moderate counts in 10 samples out of 300
nsamp <- 10
keep <- rowSums(counts(dds, normalized=TRUE) >= 10) >= nsamp
table(keep)
dds <- dds[keep,]
Timing on a laptop using latest version of DESeq2 (1.18):
> library(DESeq)
> library(BiocParallel)
> dds <- makeExampleDESeqDataSet(n=1000,m=300); dds$condition <- factor(rep(1:10,each=30))
> system.time({ DESeq(dds, quiet=TRUE) })
user system elapsed
105.664 0.020 105.678
> system.time({ DESeq(dds, quiet=TRUE, parallel=TRUE, BPPARAM=MulticoreParam(2)) })
user system elapsed
125.232 0.668 63.779
Thank you, Mike! This is very helpful. I have 25000 genes. It might take 50 minutes if it scales linearly with the number of genes. But the parallel options will help.
I did some investigation on how DESeq2 scales with the number of samples, N. It shows that computing time scales perfectly with O(N^2). See this plot.
In iDEP, a web application for analyzing RNA-seq data, I have to make it default to limma-voom for larger data for now. Limma does not seem to have this problem.
I like DESeq2 and use it a lot. For myself and for tools I develop.
Steven Ge
@StevenXGe
http://ge-lab.org/
Yes, I myself use limma-voom when I have hundreds of samples.
You probably don’t have 25,000 expressed genes but maybe more like 10,000. So roughly you can expect 20 minutes, or even less if you use e.g. four cores.
This is the code for the experiment:
rawCounts = read.csv("count_Known_gene_RNAseq_mouse_embryo.csv")
rawCounts = rawCounts[!duplicated(rawCounts[,1]) , ]
rownames(rawCounts)= rawCounts[,1]
rawCounts = rawCounts[,-1]
rawCounts = round(rawCounts,0)
groups = gsub("\..*","", colnames(rawCounts) )
colData = cbind(colnames(rawCounts), groups )
library(DESeq2)
time =c()
sizes = 10:1:10
for( n in sizes) {
dds = DESeqDataSetFromMatrix(countData=rawCounts[,1:n],
colData=colData[1:n,],
design=~groups)
tem = system.time( DESeq(dds) )
cat(n, tem)
time=c(time, tem[1])
}
plot(sizes, time)
time2 = sqrt(time)
plot( sizes, time2, xlab="Number of RNA-Seq Samples ",
ylab="Square Root of Time (seconds)", main="DESeq2 scalability" )
abline(lm( time2 ~ sizes))