I want to perform differential expression analysis on a data set containing 17,000 samples. The salmon quant.sf files are about 1.5 Tb.
based on my naive understanding of R and R packages I believe I will need to run on a single very large machine, that is to say, I can not take advantage of a cluster of machines.
I read the section in the vignette on ' Using parallelization'.
Is there a rule of thumb for machine sizing?
I plan to run my analysis in either AWS or GCP so I should be able to access a very large machine.
Can you recommend docker image?
Any suggestions for how much SDD, memory, swap, cpu, ... I should use and what the run time is likely to be?
Should I consider porting a pair bones version to something like apache spark so I can throw a lot of machines at the problem?
Kind regards
Andy
Running DESeq with 1000 samples
If you ask me the memory is the only factor here, with limma-voom things are single-threaded. You can simulate this by running a dummy matrix with 10-20...100,200,...500... samples on a local machine using limma-voom and collect memory statistics. That should get an estimate. A docker image might make sense as you do not need to install anything, or simply the required packages with conda. It is just a DE analysis after all, as long as you have enough memory you will be fine, and once you have the results table you can go back to any standard laptop for downstream analysis.
Hi Michael
These are bulk samples.
I like your idea of running a couple of subsets to get a rough idea about the required resources. I am not sure why you would want to use limma-voom to do the resource estimation instead of using DESeq2?
thanks
Andy
DESeq2 with many samples
I recommend and use limma-voom for large bulk datasets often, it is much faster than GLM-based methods.
Any recommendations for normalization, variance stabilization, or other additional steps?
What exactly is the problem? Both normalization and vst scale well with larger datasets. Should take a minute or less even for hundreds of samples.
I was asking in terms of limma voom. But as I decided to actually read about what limma voom does, I found out that it does a different kind of transformation to make the data approximate normal. Terrible idea for small data sets of count data, but probably decent for large ones. I have on the order of10^4 genes and 10^4 samples. I may just give DESeq2 a shot anyway as I can spin up a pretty hefty machine on GCP.