Question: lowess vs. loess
1
16.2 years ago by
nataraja@mit.edu30 wrote:

Hello!  I have noticed a distinction being made between lowess and loess for the normalization of microarray data, but I'm not quite clear about what the difference is between the two techniques. From William Cleveland's website, it seems that the major difference is that lowess uses only one predictor variable, whereas loess can be used with more than one predictor:

For intensity-based normalization (one predictor) wouldn't the two algorithms boil down to the same thing? Any insight would be greatly appreciated!!

Thank you,
Sripriya Natarajan
Graduate Student, Center for Vascular Biology
Dept. of Pathology, Brigham and Women's Hospital

microarray normalization • 4.8k views
modified 3.5 years ago by Gordon Smyth39k • written 16.2 years ago by nataraja@mit.edu30
3
16.2 years ago by
Gordon Smyth39k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth39k wrote:

Local regression is a statistical method for robustly fitting smoothing curves without prior assumptions about the shape or form of the curve. 'lowess' and 'loess' are two different functions in R which implement local regression. As you've noted, 'loess' is the more recent program and it is essentially equivalent to 'lowess' except that it has more features.

The R function 'loess' has three things that 'lowess' doesn't:

1. It accepts a formula specifying the model rather than the x and y matrices
2. As you've noted, it can be used with more than one predictor.
3. It accepts prior weights.
4. It will estimate the "equivalent number of parameters" implied by the fitted curve.

On the other hand, 'loess' is much slower than 'lowess' and occasionally fails when 'lowess' succeeds, so both programs are kept in R.

When there is only one predictor variable and no prior weights, 'lowess' and 'loess' are in principle exactly equivalent. However the default settings of the two programs are very different. Here is an example in which I force 'lowess' and 'loess' to do precisely the same numerical calculation:

 > y <- rnorm(1000)
> x <- 1:1000
> out.lowess <- lowess(x,y,f=0.3,iter=3,delta=0)
> out.lowess\$y[1:5]
[1] 0.1816632 0.1799619 0.1782683 0.1765826 0.1749048
> out.loess <- loess(y~x,span=0.3,degree=1,family="symmetric",iterations=4,surface="direct")
> fitted(out.loess)[1:5]
[1] 0.1816632 0.1799619 0.1782683 0.1765826 0.1749048

Things to note here:

1. 'f' is the 'span' argument for 'lowess'
2. 'loess' does quadratic (degree=2) local regression by default instead of linear (degree=1)
3. Unless you specify family="symmetric", loess will fit the curve by least squares, i.e., won't do any robustness iterations at all.
4. lowess and loess count iterations in differently: 'iter' in lowess means the number of robustness iterations; 'iterations' in loess means the total number of iterations including the least squares fit, i.e., iterations=iter+1

The only aspect in which it is not possible to make 'loess' and 'lowess' agree exactly is in their treatment of large data sets. When x and y are very long, say 10s of thousands of observations, it is impractical and unnecessary to do the local regression calculation exactly, rather it is usual to interpolate between observations that are very close together. This interpolation is controlled by the 'delta' argument to 'lowess' and by the 'cell' and 'surface' arguments to 'loess'.

When there are a large number of observations, 'lowess' groups together those x-values that are closer than a certain distance apart. Although grouping observations based on distance is in-principle the best approach, this is impractical for 'loess' because 'loess' is designed to accept multiple x-variables. So 'loess' instead groups observations together based on the number of observations on a cell rather than on distances. Because of this small difference, 'lowess' and 'loess' will almost always give slightly different numerical results for large data sets. The difference is in favour of lowess but is generally small enough to be unimportant.

Gordon

1
16.2 years ago by
Gordon Smyth39k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth39k wrote:

To follow up on my general remarks on lowess and loess, I should also explain the slight differences between loess normalization in marrayNorm and limma.

I am the one mostly to blame for the fact that marrayNorm and limma are not exactly the same for loess normalization. Jean and I co-ordinated marrayNorm and limma earlier in the year to use:

span=0.4
degree=1
iterations=5
surface="direct"

(See my other answer to this question for the meaning of these parameters.) These parameter setting are conservative choices. They result in a relatively stiff curve, with a high degree of robustness and with exact loess calculations involving no interpolation. The decision to avoid interpolation was motivated more by the desire to avoid confusing warning messages from 'loess' rather than because interpolation is not accurate.

As some users have noted on this mailing list, the avoidance of interpolation results in very, very slow fits for some data sets. It was much, much too slow for me anyway. So I re-introduced interpolation to limma and have implemented some warning suppression at a lower code level to avoid the confusing warning messages. limma currently uses default values:

span=0.3
degree=1
iterations=4
interpolation: 'lowess'-style interpolation where possible, otherwise 'loess'-style

The default values in limma agree exactly with the earlier software SMA, i.e., with the software that was used for the original papers on loess normalization. If you want limma to produce the slightly stiffer, slightly more robust curves produced by marrayNorm, you can use

normalizeWithinArrays(RG, span=0.4, iterations=5)

The only difference between limma and marrayNorm will then be a result of interpolation used by limma.

Which parameter settings are best? Data analysis is not such a precise science that it is possible to give categorical answers. Either span=0.3 or span=0.4 are acceptable. In general, a higher value for span is appropriate if your data doesn't show much intensity-dependence in dye-bias and vice-versa. Iterations=4 produces a reasonably robust fit. If you desperately need a very robust procedure, perhaps because you have a very high proportion of differentially expressed genes, then the most robust possible print-tip intensity-based normalization procedure is available from

normalizeWithinArrays(RG, method="robustspline", robust="MM")

Gordon