Question: Limma linear fit residual standard deviation calculation
0
gravatar for krassowski.michal
29 days ago by
krassowski.michal0 wrote:

This is a low-level question, but I am very curious about the reason for limma calculating the residual standard deviation (sigma) in lmFit (or more precisely, lm.series) as:

# 1. lm.series
sqrt(mean(out$effects[-(1:out$rank)]^2))

rather than:

# 2. what I would expect
x <- out$residuals
sqrt(sum((x - mean(x))^2) / (length(x) - out$rank))
  1. Is there a numerical advantage of using the former, or some special case in which it would give different result?

  2. Would anyone be able to provide a reference on this way of calculating the residual standard deviation?

In my comparisons, the two are very comparable (mean absolute difference on the order of 1e-17), though I do not have a good intuition for the maths of effects.

Initially, I thought that it might be related to performance, but a simple benchmark suggests that the second option is actually faster:

  1. 690 ms ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  2. 455 ms ± 28.8 ms per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Also, the latter gives results closer to the R's stats::sigma function:

stats::sigma(lm(y ~ X))

with the mean absolute difference on the order of 1e-19 for my test dataset (1e-17 for the lm.series implementation).

limma • 98 views
ADD COMMENTlink modified 29 days ago by Gordon Smyth38k • written 29 days ago by krassowski.michal0
Answer: Limma linear fit residual standard deviation calculation
2
gravatar for Gordon Smyth
29 days ago by
Gordon Smyth38k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth38k wrote:

As you've already confirmed, limma gives the correct answer.

Your formula for sigma would fail if y contains NA or if X does not span the intercept term. However a slight correction:

sum(out$residuals^2, na.rm=TRUE) / out$df.residual

would work and could equally have been used by limma. The formula that limma uses agrees with how one would compute sigma at the C level but, at the R level, it gets slowed down very slightly by having to subset the effects vector.

If you want to understand what the effects are, you might start with these references:

http://www.statsci.org/smyth/pubs/EoB/ban041-.pdf https://onlinelibrary.wiley.com/doi/full/10.1002/0470011815.b2a14022

Mathematically, the effects are Q^T y where Q is from the QR decomposition of X / sqrt(weights).

BTW, I do not normally answer questions about limma code, only about how limma behaves at the user level. I've made an exception here because you're a student, but time just doesn't allow me to do that on a regular basis.

ADD COMMENTlink modified 22 days ago • written 29 days ago by Gordon Smyth38k

Thank you for taking the time to answer this question, I really appreciate it. I find the second reference very useful. Thank you!

ADD REPLYlink written 28 days ago by krassowski.michal0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 211 users visited in the last hour