This is a low-level question, but I am very curious about the reason for limma calculating the residual standard deviation (sigma
) in lmFit
(or more precisely, lm.series
) as:
# 1. lm.series
sqrt(mean(out$effects[-(1:out$rank)]^2))
rather than:
# 2. what I would expect
x <- out$residuals
sqrt(sum((x - mean(x))^2) / (length(x) - out$rank))
Is there a numerical advantage of using the former, or some special case in which it would give different result?
Would anyone be able to provide a reference on this way of calculating the residual standard deviation?
In my comparisons, the two are very comparable (mean absolute difference on the order of 1e-17), though I do not have a good intuition for the maths of effects
.
Initially, I thought that it might be related to performance, but a simple benchmark suggests that the second option is actually faster:
- 690 ms ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 100000 loops each)
- 455 ms ± 28.8 ms per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Also, the latter gives results closer to the R's stats::sigma
function:
stats::sigma(lm(y ~ X))
with the mean absolute difference on the order of 1e-19 for my test dataset (1e-17 for the lm.series
implementation).
Thank you for taking the time to answer this question, I really appreciate it. I find the second reference very useful. Thank you!