I am doing DE on a massive dataset (count matrix is ~300 samples x200.000 genes/count features). I have access to cluster computer. I want to use edgeR. Can I split this up into multiple run with batches of genes (i.e. run 20 runs with x 15,000 genes each), or will this distort my dispersion estimate?
Well, the dispersion estimates will definitely be different from an analysis using all genes together. The question is whether or not this matters. If you subset the data into multiple batches with >10000 genes each, you should still have enough information per run to get a stable estimate of the common/trended dispersion (depending on what flavour of edgeR you're using). So, it should be valid to apply edgeR within each subset.
The key here is to avoid subsetting it in silly ways that affect the amount of information in each sample. For example, don't put most of the high-abundance genes in one subset and most of the low-abundance genes in another subset, because then you won't have enough information to stably estimate the mean-dispersion trend in both subsets. I'd recommend randomly sampling genes into the subsets, to avoid any biases in selection.
The simpler alternative is to just allocate more memory on the cluster (maybe around 20-30 GB?), use all the genes together and and let edgeR run for an hour or so. Or overnight, depending on how it goes.
Edit: Perhaps an obvious point, but check whether filtering brings down the total number of features. If 90% of the features have average counts close to zero, you might as well toss them out because they won't be particularly useful for DE testing anyway.
I did a similar run for 124 samples x 100.000 genes and this took approximately 8 h to do
> #edgeR independent samples
> y <- estimateGLMCommonDisp(y, design, verbose=TRUE)
Disp = 3.86131 , BCV = 1.965
Time difference of 30.43362 mins
> y <- estimateGLMTrendedDisp(y, design)
Loading required package: splines
Time difference of 4.946792 hours
> y <- estimateGLMTagwiseDisp(y, design)
> fit <- glmFit(y, design)
> lrt <- glmLRT(fit)
Time difference of 2.558655 hours
Right now, the estimateGLMCommonDisp(y, design, verbose=TRUE) on the larger set has been running for ~12 h and is still not done ( i gave it 16 GB ). Does allocating more memory help?
EDIT: I can probably filter out more genes. This is my ASE that you helped me out with so brilliantly last week, so there are a lot of zeros and low counts. Despite aggressive filtering, around ~200,000 items remain