Question

Limma linear fit residual standard deviation calculation

0

Entering edit mode

krassowski.michal ▴ 40

@krassowskimichal-19240

Last seen 5.0 years ago

This is a low-level question, but I am very curious about the reason for limma calculating the residual standard deviation (sigma) in lmFit (or more precisely, lm.series) as:

# 1. lm.series
sqrt(mean(out$effects[-(1:out$rank)]^2))

rather than:

# 2. what I would expect
x <- out$residuals
sqrt(sum((x - mean(x))^2) / (length(x) - out$rank))

Is there a numerical advantage of using the former, or some special case in which it would give different result?
Would anyone be able to provide a reference on this way of calculating the residual standard deviation?

In my comparisons, the two are very comparable (mean absolute difference on the order of 1e-17), though I do not have a good intuition for the maths of effects.

Initially, I thought that it might be related to performance, but a simple benchmark suggests that the second option is actually faster:

690 ms ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 100000 loops each)
455 ms ± 28.8 ms per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Also, the latter gives results closer to the R's stats::sigma function:

stats::sigma(lm(y ~ X))

with the mean absolute difference on the order of 1e-19 for my test dataset (1e-17 for the lm.series implementation).

limma • 1.2k views

ADD COMMENT • link updated 5.7 years ago by Gordon Smyth 52k • written 5.7 years ago by krassowski.michal ▴ 40

score 3 · Accepted Answer · 2019-07-22

As you've already confirmed, limma gives the correct answer.

Your formula for sigma would fail if y contains NA or if X does not span the intercept term. However a slight correction:

sum(out$residuals^2, na.rm=TRUE) / out$df.residual

would work and could equally have been used by limma. The formula that limma uses agrees with how one would compute sigma at the C level but, at the R level, it gets slowed down very slightly by having to subset the effects vector.

If you want to understand what the effects are, you might start with these references:

http://www.statsci.org/smyth/pubs/EoB/ban041-.pdf https://onlinelibrary.wiley.com/doi/full/10.1002/0470011815.b2a14022

Mathematically, the effects are Q^T y where Q is from the QR decomposition of X / sqrt(weights).

BTW, I do not normally answer questions about limma code, only about how limma behaves at the user level. I've made an exception here because you're a student, but time just doesn't allow me to do that on a regular basis.