Question

Is lmFit() the fastest way to fit thousands of linear model in R and use what algorithm make it so fast ?

0

Entering edit mode

heyao ▴ 30

@heyao-14543

Last seen 4.1 years ago

Hi all,

I just noticed that lmFit() is quite fast for linear model fit even for very big single cell data.

This make me curious that how fast does lmFit() achieve but I haven't found any benchmarking results about that.

If lmFit() is actually the fastest way for simple linear model fit, I guess I would like to always use lmFit() instead of tidyr+purrr+lm() .

If lmFit() is fast because it uses some algorithms where can I find the implemention details ? Assuming I want to write a silimar function using other language like python or julia for my own use.

limma • 3.6k views

ADD COMMENT • link updated 5.3 years ago by James W. MacDonald 65k • written 5.3 years ago by heyao ▴ 30

0

Entering edit mode

I recently had RegParallel accepted to Bioconductor, which can fit independent models via lm(), glm(), bayesglm(), coxph(), et cetera, and parallelise this process (also 'nested' parallel processing is possible). It is not as quick as lmFit, though. I developed it while having to run conditional logistic regression models on old GWAS data for trios.

ADD REPLY • link 5.3 years ago Kevin Blighe ★ 3.9k

score 0 · Answer 1 · 2019-01-04

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 2 hours ago

United States

Yes, lmFit is super fast, mainly because it directly calls (and manipulates data from) lm.fit or lm.wfit, which call C code to do the work. I cannot imagine that some concatenation of tidyr and purr and lm would be even remotely close because that almost certainly involves fitting each model separately whereas lm.fit is fitting all the models simultaneously. But then I avoid the tidyverse like a plague, so maybe I am not informed enough to say whether or not some combination of tidyr and purr and lm would be reasonable.

You can find the implementation details in the source code for lmFit (which calls lm.series or glm.series or mrlm, depending on arguments), all of which end up calling either lm.fit or lm.wfit, with the right arguments depending on the model being fit. And inside of both lm.fit and lm.wfit are calls to the C code, which you can get by downloading the R sources.

ADD COMMENT • link 5.3 years ago James W. MacDonald 65k

0

Entering edit mode

lm.fit is probably as fast as it can possibly be; the heavy lifting is done by the C code with LINPACK routines in a vectorized manner across all genes. Unconditional speed increases would probably require fiddling around with the compilation settings for BLAS or some other low-level stuff that would be beyond the ability of the average user.

limma's handling of weighted linear models could, in theory, be sped up by migrating the heavy lifting to C(++). Currently, lmFit calls lm.wfit in a R-level loop across all genes, and looping in R is generally a good candidate for optimization. However, this is not a priority - limma is plenty fast enough as it is, and frankly, we have better things to do.

ADD REPLY • link 5.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you James and Aaron, the answer and comments explain what I want to know clearly ！

ADD REPLY • link 5.3 years ago heyao ▴ 30