Question

DESeq2 taking long time to run with 270 samples in 10 groups.

2

Entering edit mode

Steven Ge ▴ 40

@steven-ge-15003

Last seen 3.5 years ago

Sioux Falls, SD

How scalable is DESeq2 to larger datasets?

I am running DESeq2 on a data set with 270 samples belonging to 10 sample groups.

But it takes forever.

colData = cbind(colnames(rawCounts), groups )

dds = DESeqDataSetFromMatrix(countData=rawCounts, colData=colData, design=~groups)

dds = DESeq(dds)

Thanks!

deseq2 • 4.1k views

ADD COMMENT • link updated 7.1 years ago by Michael Love 43k • written 7.1 years ago by Steven Ge ▴ 40

score 1 · Answer 1 · 2018-02-11

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 9 days ago

United States

Here is some example timing in the code chunk at the bottom. Running DESeq() on 1000 genes for 300 samples across 10 conditions takes <2 minutes, and you can easily run this in parallel over multiple cores.

Another thing to help the speed: you can make sure you're not running DESeq() on genes with little power by pre-filtering on normalized counts. While this is not necessary for the statistical method, and strict pre-filtering isn't part of our vignette code, it can speed things up a lot because there are many genes with small counts.

dds <- estimateSizeFactors(dds)
# here filtering on moderate counts in 10 samples out of 300
nsamp <- 10
keep <- rowSums(counts(dds, normalized=TRUE) >= 10) >= nsamp
table(keep)
dds <- dds[keep,]

Timing on a laptop using latest version of DESeq2 (1.18):

> library(DESeq)
> library(BiocParallel)
> dds <- makeExampleDESeqDataSet(n=1000,m=300); dds$condition <- factor(rep(1:10,each=30))
> system.time({ DESeq(dds, quiet=TRUE) })
   user  system elapsed 
105.664   0.020 105.678 
> system.time({ DESeq(dds, quiet=TRUE, parallel=TRUE, BPPARAM=MulticoreParam(2)) })
   user  system elapsed 
125.232   0.668  63.779

ADD COMMENT • link 7.1 years ago Michael Love 43k

2

Entering edit mode

Thank you, Mike! This is very helpful. I have 25000 genes. It might take 50 minutes if it scales linearly with the number of genes. But the parallel options will help.

I did some investigation on how DESeq2 scales with the number of samples, N. It shows that computing time scales perfectly with O(N^2). See this plot.

In iDEP, a web application for analyzing RNA-seq data, I have to make it default to limma-voom for larger data for now. Limma does not seem to have this problem.

I like DESeq2 and use it a lot. For myself and for tools I develop.

Steven Ge

@StevenXGe

http://ge-lab.org/

ADD REPLY • link 7.1 years ago Steven Ge ▴ 40

0

Entering edit mode

Yes, I myself use limma-voom when I have hundreds of samples.

You probably don’t have 25,000 expressed genes but maybe more like 10,000. So roughly you can expect 20 minutes, or even less if you use e.g. four cores.

ADD REPLY • link 7.1 years ago Michael Love 43k

0

Entering edit mode

This is the code for the experiment:

rawCounts = read.csv("count_Known_gene_RNAseq_mouse_embryo.csv")

rawCounts = rawCounts[!duplicated(rawCounts[,1]) , ]

rownames(rawCounts)= rawCounts[,1]

rawCounts = rawCounts[,-1]

rawCounts = round(rawCounts,0)

groups = gsub("\..*","", colnames(rawCounts) )

colData = cbind(colnames(rawCounts), groups )

library(DESeq2)

time =c()

sizes = 10:1:10

for( n in sizes) {

dds = DESeqDataSetFromMatrix(countData=rawCounts[,1:n],

colData=colData[1:n,],

design=~groups)

tem = system.time( DESeq(dds) )

cat(n, tem)

time=c(time, tem[1])

}

plot(sizes, time)

time2 = sqrt(time)

plot( sizes, time2, xlab="Number of RNA-Seq Samples ",

ylab="Square Root of Time (seconds)", main="DESeq2 scalability" )

abline(lm( time2 ~ sizes))

ADD REPLY • link 7.1 years ago Steven Ge ▴ 40