Question

Deseq2 , result() taking too long > 3hrs

0

Entering edit mode

Nidhi • 0

@628d58c8

Last seen 16 months ago

United States

I ran deseq on 8 samples, which took about 30-45mins. But on running results(), it's been more than 3 hours and still running. I can hear my laptop buzzing.

dd_cc1=DESeq(dds_CC)

estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing

res.cc=results(dds_cc1,contrast = c("sex","male","female"))


sessionInfo( )

Thoughts?

DESeq2 • 1.7k views

ADD COMMENT • link updated 14 months ago by jon50250 • 0 • written 16 months ago by Nidhi • 0

0

Entering edit mode

It ran faster if I filtered the table based on reads. So it works, but it's just being slow with larger datasets

ADD REPLY • link 16 months ago Nidhi • 0

score 0 · Answer 1 · 2022-12-05

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 5 minutes ago

United States

Something is off. It should take a few seconds. It should be linear with number of genes.

> dds <- makeExampleDESeqDataSet(n=1e4, m=8)
> system.time({ DESeq(dds) })
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
   user  system elapsed
  1.362   0.030   1.402
> dds <- DESeq(dds, quiet=TRUE)
> system.time({ results(dds) })
   user  system elapsed
  0.077   0.005   0.083

> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.0.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods   base

other attached packages:
 [1] DESeq2_1.38.1               SummarizedExperiment_1.28.0 Biobase_2.58.0
 [4] MatrixGenerics_1.10.0       matrixStats_0.63.0          GenomicRanges_1.50.1
 [7] GenomeInfoDb_1.34.4         IRanges_2.32.0              S4Vectors_0.36.0
[10] BiocGenerics_0.44.0

ADD COMMENT • link 16 months ago Michael Love 41k

0

Entering edit mode

Hello! I'm adding to this thread as I'm having the same issue I hope that's ok! My result function looks like so:

WBL.DGE.results <- results(WBL.dds, independentFiltering = TRUE, alpha = 0.05, contrast=c("Tissue.Type","MSYM.WB","FSYM.WB"), lfcThreshold = 1, #actual fold change of 1.5 altHypothesis="greaterAbs" )

I'm also having some batch effect stuff going on (even after using limma RemoveBatchEffect so I'm running without batch for comparison) and find that if I include batch in the design matrix it takes less time.

I also filtered out low reads from the dds prior to running deseq, as mentioned here

ADD REPLY • link 14 months ago jon50250 • 0

0

Entering edit mode

It really shouldn't take more than a second.

Can you run my code above?

ADD REPLY • link 14 months ago Michael Love 41k

0

Entering edit mode

Sure, here is the code:

> dds <- makeExampleDESeqDataSet(n=1e4, m=8)
> system.time({ DESeq(dds) })
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
   user  system elapsed 
  6.267   0.321   6.857 


> dds <- DESeq(dds, quiet=TRUE)
> system.time({ results(dds) })
   user  system elapsed 
  0.415   0.044   0.493 

sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.7.4

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] RColorBrewer_1.1-3          pheatmap_1.0.12            
 [3] ggrepel_0.9.3               rhdf5_2.40.0               
 [5] tximport_1.24.0             DESeq2_1.36.0              
 [7] SummarizedExperiment_1.26.1 Biobase_2.56.0             
 [9] MatrixGenerics_1.8.1        matrixStats_0.63.0         
[11] GenomicRanges_1.48.0        GenomeInfoDb_1.32.4        
[13] IRanges_2.30.1              S4Vectors_0.34.0           
[15] BiocGenerics_0.42.0         forcats_1.0.0              
[17] stringr_1.5.0               purrr_1.0.1                
[19] readr_2.1.4                 tidyr_1.3.0                
[21] tibble_3.1.8                ggplot2_3.4.1              
[23] tidyverse_1.3.2             dplyr_1.1.0                

loaded via a namespace (and not attached):
 [1] bitops_1.0-7           fs_1.6.1               lubridate_1.9.2       
 [4] bit64_4.0.5            httr_1.4.4             tools_4.2.2           
 [7] backports_1.4.1        utf8_1.2.3             R6_2.5.1              
[10] DBI_1.1.3              colorspace_2.1-0       rhdf5filters_1.8.0    
[13] withr_2.5.0            tidyselect_1.2.0       bit_4.0.5             
[16] compiler_4.2.2         cli_3.6.0              rvest_1.0.3           
[19] xml2_1.3.3             DelayedArray_0.22.0    scales_1.2.1          
[22] genefilter_1.78.0      XVector_0.36.0         pkgconfig_2.0.3       
[25] dbplyr_2.3.0           fastmap_1.1.0          limma_3.52.4          
[28] rlang_1.0.6            readxl_1.4.2           rstudioapi_0.14       
[31] RSQLite_2.3.0          generics_0.1.3         jsonlite_1.8.4        
[34] BiocParallel_1.30.4    googlesheets4_1.0.1    RCurl_1.98-1.10       
[37] magrittr_2.0.3         GenomeInfoDbData_1.2.8 Matrix_1.5-1          
[40] Rhdf5lib_1.18.2        Rcpp_1.0.10            munsell_0.5.0         
[43] fansi_1.0.4            lifecycle_1.0.3        stringi_1.7.12        
[46] zlibbioc_1.42.0        grid_4.2.2             blob_1.2.3            
[49] parallel_4.2.2         crayon_1.5.2           lattice_0.20-45       
[52] Biostrings_2.64.1      haven_2.5.1            splines_4.2.2         
[55] annotate_1.74.0        hms_1.1.2              KEGGREST_1.36.3       
[58] locfit_1.5-9.7         pillar_1.8.1           geneplotter_1.74.0    
[61] codetools_0.2-18       reprex_2.0.2           XML_3.99-0.13         
[64] glue_1.6.2             modelr_0.1.10          vctrs_0.5.2           
[67] png_0.1-8              tzdb_0.3.0             cellranger_1.1.0      
[70] gtable_0.3.1           assertthat_0.2.1       cachem_1.0.6          
[73] xtable_1.8-4           broom_1.0.3            survival_3.4-0        
[76] googledrive_2.0.0      gargle_1.3.0           AnnotationDbi_1.58.0  
[79] memoise_2.0.1          timechange_0.2.0       ellipsis_0.3.2

This ran really fast. maybe it has something to do with other loaded packages/libraries? I noticed I have way more here than what your output had

ADD REPLY • link 14 months ago jon50250 • 0

0

Entering edit mode

Well you can try to figure out the difference between the two runs. e.g. you can add genes and samples. Not knowing anything else, I just mention that expect with very large number of parameters, e.g. sometimes people try to fit models with 50 coefficients, DESeq() is often running in less than a minute.

ADD REPLY • link 14 months ago Michael Love 41k

0

Entering edit mode

I see, thanks for your input! At most I'm working with 3 coefficients, but there's a large number of genes that make up the dds (~709,000 elements if I include batch, and ~645,000 elements without batch). I'm going to try clustering my transcripts before getting generating a genelist to cut down on the noise and see if that helps

ADD REPLY • link 14 months ago jon50250 • 0

0

Entering edit mode

The slowdown is likely due to memory limits of your machine. One million rows is a lot to churn through, and you are probably having to use virtual memory to deal with RAM shortage.

I'd definitely recommend more filtering, as that is way over the number of expressed transcripts in eukaryotic cells. It's probably closer to 1e4.

How fast is it if you run it on the top expressed transcripts?

E.g.

dds2 <- dds[ order(rowSums(counts(dds)), decreasing=TRUE)[1:1000], ]
# then do DESeq() and results()

ADD REPLY • link 14 months ago Michael Love 41k

0

Entering edit mode

Right, I tried upping the RAM given to R via .Renviron on commandline for my mac to 2.5Gb, but alas not enough the code:

> system.time({
+   WBL.dds2 <- WBL.dds[ order(rowSums(counts(WBL.dds)), decreasing=TRUE)[1:1000], ]
+   WBL.dds2 <- DESeq(WBL.dds2)
+   WBL.DGE.results2 <- results(WBL.dds2,
+                              independentFiltering = TRUE,
+                              alpha = 0.05,
+                              contrast=c("Tissue.Type","MSYM.WB","FSYM.WB"),
+                              lfcThreshold = 1, #actual fold change of 1.5
+                              altHypothesis="greaterAbs"
+   )
+   })

using pre-existing size factors
estimating dispersions
found already estimated dispersions, replacing these
gene-wise dispersion estimates
mean-dispersion relationship
-- note: fitType='parametric', but the dispersion trend was not well captured by the
   function: y = a/x + b, and a local regression fit was automatically substituted.
   specify fitType='local' or 'mean' to avoid this message next time.
final dispersion estimates
fitting model and testing
   user  system elapsed 
  2.654   0.094   2.813

This is much faster! Would this be loosing any DE data though as it only looks at top expressed across samples?

ADD REPLY • link 14 months ago jon50250 • 0

0

Entering edit mode

Yes, this was just to demonstrate timing and the the issue is likely lack of available memory. I'd recommend to pick in the neighborhood of ~20k or so features. You can go deeper into what constitutes expression, and think about how far down the list to go, but do think about how many transcripts are actually expressed.

A common filter is:

keep <- rowSums(counts(dds) >= 10) >= X
dds <- dds[keep,]

Where X is some number of samples, maybe half your dataset.

ADD REPLY • link 14 months ago Michael Love 41k

0

Entering edit mode

I see. This is straight from the vignette I'm bummed I missed it! I run a prefilter of:

WBL.keep <- rowSums(counts(WBL.dds) >= 10
WBL.dds <- WBL.dds[WBL.keep,]

But totally missed the last >= X part. Similar to the OPs design, I'm working with the same organ between sexes and want to characterize differences there. At most I'm working with 15 samples, but usually 10, so I'll go with 5 for the X in the prefilter

As for the ~20k elements, would this go here?

  WBL.dds2 <- WBL.dds[ order(rowSums(counts(WBL.dds)), decreasing=TRUE)[1:20000], ]

ADD REPLY • link 14 months ago jon50250 • 0

1

Entering edit mode

You can skip the 20k elements filter and just use the top filter.

ADD REPLY • link 14 months ago Michael Love 41k

0

Entering edit mode

Great tysm Michael for all your help to me, the countless other folks you've helped, and creating a really really useful tool! Sincerely, John

PS I may be posting about my problem with the batch effect stuff I still see in my data soon!

ADD REPLY • link 14 months ago jon50250 • 0