I want to perform differential expression analysis on a data set containing 17,000 samples. The salmon quant.sf files are about 1.5 Tb.
based on my naive understanding of R and R packages I believe I will need to run on a single very large machine, that is to say, I can not take advantage of a cluster of machines.
I read the section in the vignette on ' Using parallelization'.
Is there a rule of thumb for machine sizing?
I plan to run my analysis in either AWS or GCP so I should be able to access a very large machine.
Can you recommend docker image?
Any suggestions for how much SDD, memory, swap, cpu, ... I should use and what the run time is likely to be?
Should I consider porting a pair bones version to something like apache spark so I can throw a lot of machines at the problem?