DESeq2 taking long time to run with 270 samples in 10 groups.
1
2
Entering edit mode
Steven Ge ▴ 40
@steven-ge-15003
Last seen 2.5 years ago
Sioux Falls, SD

How scalable is DESeq2 to larger datasets? 

I am running DESeq2 on a data set with 270 samples belonging to 10 sample groups. 

But it takes forever. 

colData = cbind(colnames(rawCounts), groups )

dds = DESeqDataSetFromMatrix(countData=rawCounts, colData=colData, design=~groups)

dds = DESeq(dds) 

Thanks!

 

deseq2 • 3.4k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 2 hours ago
United States

Here is some example timing in the code chunk at the bottom. Running DESeq() on 1000 genes for 300 samples across 10 conditions takes <2 minutes, and you can easily run this in parallel over multiple cores.

Another thing to help the speed: you can make sure you're not running DESeq() on genes with little power by pre-filtering on normalized counts. While this is not necessary for the statistical method, and strict pre-filtering isn't part of our vignette code, it can speed things up a lot because there are many genes with small counts.

dds <- estimateSizeFactors(dds)
# here filtering on moderate counts in 10 samples out of 300
nsamp <- 10
keep <- rowSums(counts(dds, normalized=TRUE) >= 10) >= nsamp
table(keep)
dds <- dds[keep,] 

Timing on a laptop using latest version of DESeq2 (1.18):

> library(DESeq)
> library(BiocParallel)
> dds <- makeExampleDESeqDataSet(n=1000,m=300); dds$condition <- factor(rep(1:10,each=30))
> system.time({ DESeq(dds, quiet=TRUE) })
   user  system elapsed 
105.664   0.020 105.678 
> system.time({ DESeq(dds, quiet=TRUE, parallel=TRUE, BPPARAM=MulticoreParam(2)) })
   user  system elapsed 
125.232   0.668  63.779
ADD COMMENT
2
Entering edit mode

Thank you, Mike!  This is very helpful. I have 25000 genes. It might take 50 minutes if it scales linearly with the number of genes. But the parallel options will help. 

I did some investigation on how DESeq2 scales with the number of samples, N. It shows that computing time scales perfectly with O(N^2).  See this plot

In  iDEP, a web application for analyzing RNA-seq data,  I have to make it default to limma-voom for larger data for now. Limma does not seem to have this problem. 

I like DESeq2 and use it a lot. For myself and for tools I develop.

Steven Ge

@StevenXGe

http://ge-lab.org/ 

 

ADD REPLY
0
Entering edit mode

Yes, I myself use limma-voom when I have hundreds of samples.

You probably don’t have 25,000 expressed genes but maybe more like 10,000. So roughly you can expect 20 minutes, or even less if you use e.g. four cores.

ADD REPLY
0
Entering edit mode

This is the code for the experiment: 

rawCounts = read.csv("count_Known_gene_RNAseq_mouse_embryo.csv")

rawCounts = rawCounts[!duplicated(rawCounts[,1]) , ]

rownames(rawCounts)= rawCounts[,1]

rawCounts = rawCounts[,-1]

rawCounts = round(rawCounts,0)

groups = gsub("\\..*","", colnames(rawCounts)   )

colData = cbind(colnames(rawCounts), groups )

library(DESeq2)

time =c() 

sizes = 10:1:10

for( n in sizes) { 

dds = DESeqDataSetFromMatrix(countData=rawCounts[,1:n],

                             colData=colData[1:n,],

                             design=~groups)

 

tem = system.time( DESeq(dds) )

cat(n, tem)

time=c(time, tem[1])

}

 

plot(sizes, time)

time2 = sqrt(time)

plot(  sizes, time2, xlab="Number of RNA-Seq Samples ", 

      ylab="Square Root of Time (seconds)", main="DESeq2 scalability" )

abline(lm( time2 ~ sizes))

 
ADD REPLY

Login before adding your answer.

Traffic: 697 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6