Question

Improving performance of edgeR and limma while adjusting for multiple confounders

0

Entering edit mode

robinpaul85 • 0

@e5f23111

Last seen 3 days ago

United States

I am trying to run edgeR and limma for very high sample sample size groups (group1 sample size = 236 and group2 sample size = 490). The glmQLFit() in edgeR and voom() + lmFit() in limma-voom takes about 2 min when using two confounding variables. For even higher sample sizes where each group has above 1000 samples, that function takes several minutes.

I am wondering can this function be optimized so as to reduce the compute time? Is there any way we can parallelize the computation in this function so as to decrease the computation time?

Thank you.

edgeR limma • 612 views

ADD COMMENT • link updated 3 days ago by ATpoint ★ 4.8k • written 4 days ago by robinpaul85 • 0

0

Entering edit mode

If 2 minutes is inacceptable to you then you can still fall back to limma-trend which should give a dramatic speedup and finish in seconds. Inference if often similar.

ADD REPLY • link 3 days ago ATpoint ★ 4.8k

score 0 · Answer 1 · 2025-03-25

We recommend that you use voomLmFit() instead of voom+lmFit.

We are pretty proud of the speed of glmQLFit and voom, and we hadn't expected that a few minutes for a very large dataset would be too much a problem for users, considering the much larger computational overheads required to prepare the RNA-seq dataset in the first place.

Anyway, the short answer is that we don't offer parallelization options for either package. Our aim is that both packages should perform well on all platforms, whereas parallelization is very platform specific and, in our limited experience, offers modest gains considering the extra complications in terms of code and code dependencies. As developer, it is not something that I want to get into for a computation that only takes a few minutes in the first place, even on a laptop.

glmQLFit is already highly optimized by dropping almost everything down to clean C code. voom() could be speeded up considerably by dropping everything down to C. That is on our to-do list, but we haven't done that yet, and the function is already faster than competing methods. The slowest part of the voom pipeline is the possible use of duplicateCorrelation() for blocking. That is certainly a candidate for parallelization, but we are thinking in terms of algorithmic improvements rather going to computing technology in the first instance.