edgeR on extremely large dataset
2
0
Entering edit mode
martiningi ▴ 40
@martiningi-9274
Last seen 6.0 years ago
United States

I am doing DE on a massive dataset (count matrix is ~300 samples x200.000 genes/count features). I have access to cluster computer. I want to use edgeR. Can I split this up into multiple run with batches of genes (i.e. run 20 runs with x 15,000 genes each), or will this distort my dispersion estimate?

edgeR rnaseq • 2.4k views
2
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 8 hours ago
The city by the bay

Well, the dispersion estimates will definitely be different from an analysis using all genes together. The question is whether or not this matters. If you subset the data into multiple batches with >10000 genes each, you should still have enough information per run to get a stable estimate of the common/trended dispersion (depending on what flavour of edgeR you're using). So, it should be valid to apply edgeR within each subset.

The key here is to avoid subsetting it in silly ways that affect the amount of information in each sample. For example, don't put most of the high-abundance genes in one subset and most of the low-abundance genes in another subset, because then you won't have enough information to stably estimate the mean-dispersion trend in both subsets. I'd recommend randomly sampling genes into the subsets, to avoid any biases in selection.

The simpler alternative is to just allocate more memory on the cluster (maybe around 20-30 GB?), use all the genes together and and let edgeR run for an hour or so. Or overnight, depending on how it goes.

Edit: Perhaps an obvious point, but check whether filtering brings down the total number of features. If 90% of the features have average counts close to zero, you might as well toss them out because they won't be particularly useful for DE testing anyway.

0
Entering edit mode
martiningi ▴ 40
@martiningi-9274
Last seen 6.0 years ago
United States

Thank you for this

I did a similar run for 124 samples x 100.000 genes and this took approximately 8 h to do

> #edgeR independent samples
> y=DGEList(counts=data[,2:125],genes=data[,1])

> time1=Sys.time()
> y <- estimateGLMCommonDisp(y, design, verbose=TRUE)
Disp = 3.86131 , BCV = 1.965
> Sys.time()-time1
Time difference of 30.43362 mins

> time1=Sys.time()
> y <- estimateGLMTrendedDisp(y, design)

> Sys.time()-time1
Time difference of 4.946792 hours

> time1=Sys.time()
> y <- estimateGLMTagwiseDisp(y, design)
> fit <- glmFit(y, design)
> lrt <- glmLRT(fit)
> Sys.time()-time1
Time difference of 2.558655 hours

Right now, the estimateGLMCommonDisp(y, design, verbose=TRUE) on the larger set has been running for ~12 h and is still not done ( i gave it 16 GB ). Does allocating more memory help?

EDIT: I can probably filter out more genes. This is my ASE that  you helped me out with so brilliantly last week, so there are a lot of  zeros and low counts. Despite aggressive filtering, around ~200,000 items remain

2
Entering edit mode

There's an optimize call within estimateGLMCommonDisp to maximize the adjusted profile likelihood across all genes. That's probably the reason for why it's taking forever, especially if the optimization can't converge in this huge dataset. I'd replace all of the estimateGLM* calls with estimateDisp, that would probably make it a fair bit faster. Also, assigning more memory probably won't help here; if it's anything like the clusters I've worked with, your job should just terminate if it runs out of memory, so the fact that it's still running probably means that there's enough memory.

0
Entering edit mode

The shorter run now took 1.5 hrs, so I am optimistic that the big one can run over reasonable time - Thank you very much

0
Entering edit mode

28h later, longer run (300x200,000) still going

For fun I took the shorter run (120 x 100,000), split it up into random chunks of 10,000 items each, have close to a 100% match for p-values both from edgeR and limma compared to running. Ran in parallel for about 10 min on a cluster, compared to 1.5 h for the entire thing. I think with this number of items and if you do randomly split it up, it is reasonable to run this in chunks if needed

1
Entering edit mode

Okay, I'm starting to suspect that there's a problem with the GLM convergence. If you're using a model that can't be parametrized as a one-way layout, then it'll switch to a Levenberg-style damping approach for GLM fitting (see mglmLevenberg). This has a maximum iteration cut-off to keep things quick, but that only applies for an outer loop; it also has several internal loops to guarantee an decrease in the deviance, and these are not so protected. In theory, it might be possible for those inner loops to go on forever if the deviance landscape is really complicated.

To determine if this is the case, try the chunking strategy and see if several chunks are stalling on estimateDisp. If so, it suggests that there's a couple of problematic genes, rather than the entire procedure just being generally slow. I don't think this is easily solvable from your end. Rather, we (i.e., the developers) will have to isolate the problematic genes by stepping through the compiled code. This is probably easiest to do if we can get a copy of the DGEList and design matrix - anonymized, of course. It'll also make our job easier if you can narrow down the suspects.

Finally, make sure you don't have NA values in your count matrix, that seems to make the GLM methods loop infinitely based on past experience. I don't think there's any protection against this by default.

0
Entering edit mode

Thank you

For the chunked longer run, all chunks had a timeout after 12 h (the max for "short runs"). These were chunks of 10,000 genes each. So either multiple items failed or the whole things cannot deal with 300+samples

There are no NA's, as per your prior suggestions (I'm doing ASE) i changed them to 0 and set up the matrix so I'm correcting for patient and ref and alt allele are two columns for each individuals.

Right now I'm doing serial experiments with smaller chunks, and for fun just a subset of the patients (I'm not sure how I would then combine the results for each gene though) to see what can run.

I'll send you results once I've narrowed it down

1
Entering edit mode

And finally

Spent a good chunk of yesterday going over my count matrix, eventually found a sample with all 0 counts (not NA, i had screened that a million times) that i guessed would be stalling the GLM methods similarly to NA's. Took that sample out, reran and it worked.

The longer matrix (300x200,000) runs with 16 GB of allocated RAM over roughly 15 hours.  No chunking up needed (although I learned that it seems OK for a large dataset). Thank you so much for your help

0
Entering edit mode

Good to hear. Perhaps we shall add some internal protection against these problematic NA and zero values. This is already present in mglmOneGroup, which just returns fitted values of zero and coefficient estimates of negative infinity, but I don't think it's implemented for mglmLevenberg.

This kind of gene is not usually a problem because it gets removed by abundance-based filtering. In fact, I'm surprised that it was not removed, given that you said you filtered your data fairly aggressively.

0
Entering edit mode

I filtered based on minimal number of individuals with a minimum number of counts, so even if one individual had 0 counts on all genes, this did not get filtered out, and the individual did not get highlighted.

0
Entering edit mode

Ah, sorry; I misread your post, and thought the offending feature was a gene rather than a sample. Well, this is more obvious; the library size for the offending sample will be zero, and the offset (i.e., the log-library size) for this sample will be undefined, such that everything after that screws up.

By the way, I checked and mglmLevenberg already protects against all-zero genes (all coefficients are set to NA, because nothing is estimable), so that wouldn't be causing the problem anyway.