deseq2 machine sizing best practices for very large data set
1
0
Entering edit mode
aedavids • 0
@aa611017
Last seen 1 day ago
United States

I want to perform differential expression analysis on a data set containing 17,000 samples. The salmon quant.sf files are about 1.5 Tb.

based on my naive understanding of R and R packages I believe I will need to run on a single very large machine, that is to say, I can not take advantage of a cluster of machines.

I read the section in the vignette on ' Using parallelization'.

Is there a rule of thumb for machine sizing?

I plan to run my analysis in either AWS or GCP so I should be able to access a very large machine.

Can you recommend docker image?

Any suggestions for how much SDD, memory, swap, cpu, ... I should use and what the run time is likely to be?

Should I consider porting a pair bones version to something like apache spark so I can throw a lot of machines at the problem?

Kind regards

Andy

l DESeq2 • 229 views
ADD COMMENT
0
Entering edit mode

Running DESeq with 1000 samples

If you ask me the memory is the only factor here, with limma-voom things are single-threaded. You can simulate this by running a dummy matrix with 10-20...100,200,...500... samples on a local machine using limma-voom and collect memory statistics. That should get an estimate. A docker image might make sense as you do not need to install anything, or simply the required packages with conda. It is just a DE analysis after all, as long as you have enough memory you will be fine, and once you have the results table you can go back to any standard laptop for downstream analysis.

ADD REPLY
0
Entering edit mode

Hi Michael

These are bulk samples.

I like your idea of running a couple of subsets to get a rough idea about the required resources. I am not sure why you would want to use limma-voom to do the resource estimation instead of using DESeq2?

thanks

Andy

ADD REPLY
0
Entering edit mode
ADD REPLY
2
Entering edit mode

I recommend and use limma-voom for large bulk datasets often, it is much faster than GLM-based methods.

ADD REPLY
0
Entering edit mode
@mikelove
Last seen 9 hours ago
United States

Is this bulk or single cell?

ADD COMMENT

Login before adding your answer.

Traffic: 273 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6