loess and limma
1
0
Entering edit mode
@john-seers-ifr-1605
Last seen 10.2 years ago
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20060914/ 3a832df5/attachment.pl
• 970 views
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 3 months ago
United States
On Thursday 14 September 2006 08:02, john seers (IFR) wrote: > Hello Anyone > > I was just experimenting with some made up data in limma and could not > understand the results of the "normalisation within array" using loess > smoothing. Can anybody explain for me why the results are completely > smoothed despite there being a strongly upregulated gene? > > My first reaction was that it was because there were too few points for > loess to handle it sensibly. But would you expect it to completely > smooth the results? > > Here is my artificial data with the columns in 598new and 600new > identical and with the gene in row 12 with large signal values. > > > RG$R > > 598new 599new 600new 617new 621new 637new > [1,] 310 280 310 321 290 291 > [2,] 290 285 290 311 291 292 > [3,] 295 290 295 291 301 293 > [4,] 298 295 298 281 311 294 > [5,] 301 301 301 292 321 295 > [6,] 302 305 302 302 331 296 > [7,] 303 294 303 313 281 297 > [8,] 304 293 304 324 291 298 > [9,] 305 311 305 334 303 321 > [10,] 306 312 306 314 304 322 > [11,] 307 313 307 312 312 323 > [12,] 4800 314 4800 287 324 324 > > > RG$G > > 598new 599new 600new 617new 621new 637new > [1,] 301 320 301 330 280 311 > [2,] 299 319 299 335 290 321 > [3,] 302 280 302 295 300 291 > [4,] 298 281 298 280 310 293 > [5,] 303 282 303 285 320 295 > [6,] 297 284 297 287 330 285 > [7,] 304 310 304 289 335 321 > [8,] 296 311 296 298 290 331 > [9,] 305 319 305 284 320 297 > [10,] 295 322 295 290 321 296 > [11,] 308 317 308 321 324 295 > [12,] 301 325 301 324 325 294 > > > RG$Rb > > 598new 599new 600new 617new 621new 637new > [1,] 30 30 30 30 30 30 > [2,] 30 30 30 30 30 30 > [3,] 30 30 30 30 30 30 > [4,] 30 30 30 30 30 30 > [5,] 30 30 30 30 30 30 > [6,] 30 30 30 30 30 30 > [7,] 30 30 30 30 30 30 > [8,] 30 30 30 30 30 30 > [9,] 30 30 30 30 30 30 > [10,] 30 30 30 30 30 30 > [11,] 30 30 30 30 30 30 > [12,] 30 30 30 30 30 30 > > > RG$Gb > > 598new 599new 600new 617new 621new 637new > [1,] 30 30 30 30 30 30 > [2,] 30 30 30 30 30 30 > [3,] 30 30 30 30 30 30 > [4,] 30 30 30 30 30 30 > [5,] 30 30 30 30 30 30 > [6,] 30 30 30 30 30 30 > [7,] 30 30 30 30 30 30 > [8,] 30 30 30 30 30 30 > [9,] 30 30 30 30 30 30 > [10,] 30 30 30 30 30 30 > [11,] 30 30 30 30 30 30 > [12,] 30 30 30 30 30 30 > > > > After normalising: > > # Normalise within arrays > MA.nwa <- normalizeWithinArrays(RG, method="loess") > > what looks like completely smoothed values. Any effects of the one > upregulated gene are gone. i.e. everything is very close to 1. > > > 2^MA.nwa$M > > 598new 599new 600new 617new 621new 637new > [1,] 1.0369680 0.9770115 1.0369680 1.0000000 1.0000000 1.0000000 > [2,] 1.0000000 1.0000000 1.0000000 0.9498056 1.0000000 0.8084718 > [3,] 0.9675804 1.0000000 0.9675804 1.0000000 1.0000000 1.0000000 > [4,] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0038023 > [5,] 0.9775535 1.0000000 0.9775535 1.0000000 1.0461482 1.0000000 > [6,] 0.9971243 1.0000000 0.9971243 1.0745859 1.0067460 1.0000000 > [7,] 1.0000000 1.0036845 1.0000000 1.0000000 0.8200222 0.8418535 > [8,] 1.0012844 0.9963297 1.0012844 1.0000000 1.0000000 1.0000000 > [9,] 1.0036431 0.9930062 1.0036431 1.0910065 0.9380292 1.0000000 > [10,] 1.0055536 0.9999995 1.0055536 0.9996738 1.0000000 1.0005323 > [11,] 1.0000000 1.0070430 1.0000000 0.8833719 1.0000000 1.0005324 > [12,] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 > > > Just for comparison if I do a different type of normalisation you can > > see the upregulated gene standing out: > > MA.nwa <- normalizeWithinArrays(RG, method="median") > > > > > > 2^MA.nwa$M > > 598new 599new 600new 617new 621new 637new > [1,] 1.0332103 0.8896202 1.0332103 0.9550461 1.0364855 0.9235331 > [2,] 0.9665428 0.9105525 0.9665428 0.9071081 1.0004538 0.8952134 > [3,] 0.9742647 1.0732378 0.9742647 0.9697219 1.0003118 1.0019211 > [4,] 1.0000000 1.0895189 1.0000000 0.9885219 1.0001800 0.9980826 > [5,] 0.9926740 1.1097659 0.9926740 1.0116114 1.0000573 0.9943019 > [6,] 1.0187266 1.1172789 1.0187266 1.0420495 0.9999427 1.0371934 > [7,] 0.9963504 0.9729903 0.9963504 1.0758191 0.8201698 0.9122977 > [8,] 1.0300752 0.9658553 1.0300752 1.0801029 1.0004538 0.8852921 > [9,] 1.0000000 1.0033931 1.0000000 1.1783992 0.9381981 1.0836774 > [10,] 1.0415094 0.9966184 1.0415094 1.0754682 0.9383988 1.0914894 > [11,] 0.9964029 1.0175767 0.9964029 0.9541325 0.9559423 1.0993603 > [12,] 17.6014760 0.9934796 17.6014760 0.8606734 0.9932423 1.1072908 > > > > Is this just because there are too few points? If so how do you > determine how many points are needed to make a loess normalisation > valid? >From the loess help page: Fitting is done locally. That is, for the fit at point x, the fit is made using points in a neighbourhood of x, weighted by their distance from x (with differences in 'parametric' variables being ignored when computing the distance). The size of the neighbourhood is controlled by alpha (set by 'span' or 'enp.target'). For alpha < 1, the neighbourhood includes proportion alpha of the points, and these have tricubic weighting (proportional to (1 - (dist/maxdist)^3)^3. For alpha > 1, all points are used, with the 'maximum distance' assumed to be alpha^1/p times the actual maximum distance for p explanatory variables. Notice that the "span" parameter controls the amount of data that is used in the local fit. Also, note that the span is measured in the "A" axis of the MA plot. The default span=0.3 in limma; looking at the formula for the weighting above one can see that the weights fall off with distance from the point in question. For your data, you have one point that has a quite high intensity in the red channel, leading to a very large A value. Since the loess normalization aims to fit a smoothed line and then bring that line to zero, the line at the differentially-expressed gene goes through that point, since there is little other data included in the fit, and the resulting normalized value is zero (or a ratio of one). I don't find loess to be problematic for "real" data, unless of course, it is applied where there are EXPECTED to be trends of ratio with average intensity, such as with CGH data. You could increase the value of span to see how it affects your results. Sean
ADD COMMENT
0
Entering edit mode
Hi Sean Thanks for the reply. I guess what you are confirming that it is because there are too few points. I guess what surprised me was that there was absolutely no trace of a result. Have you any idea how many points are needed for the data to become "real"? >You could increase the value of span to >see how it affects your results. I did try this but it did not help. --- John Seers Institute of Food Research Norwich Research Park Colney Norwich NR4 7UA tel +44 (0)1603 251497 fax +44 (0)1603 507723 e-mail john.seers at bbsrc.ac.uk e-disclaimer at http://www.ifr.ac.uk/edisclaimer/ Web sites: www.ifr.ac.uk www.foodandhealthnetwork.com
ADD REPLY
0
Entering edit mode
On Thursday 14 September 2006 08:46, you wrote: > Hi Sean > > Thanks for the reply. I guess what you are confirming that it is because > there are too few points. I guess what surprised me was that there was > absolutely no trace of a result. Have you any idea how many points are > needed for the data to become "real"? John, Loess assumes that there is supposed to be no correlation between average intensity and ratio--that is why it works. What you did with your experiment was to create a situation where there was a strong correlation between ratio and average intensity, thus violating that assumption. If that assumption is wrong, as it is for your data, loess will still do the calculations, but it will remove that correlation, which in your case represents signal. So, "real" data doesn't relate to the number of points, it means that the assumption of no correlation between ratio and average intensity holds, which I have generally found to be the case for gene expression data, at least to a working approximation. If that assumption DOES NOT hold, as in your data and array CGH, loess is not the correct method for normalization. Hope that helps. Sean
ADD REPLY
0
Entering edit mode
Sean Yes, that does help. Thank you. This worries me as people routinely slap in a loess normalisation. But from what you are saying this is not likely to be valid unless you have an independent reference. i.e. pooled or from another source. This means if you are using your reference channel as in a "Direct Two-Color Design" (limma user guide Chapter 7) there is likely to be a correlation so you should not use loess? Does that make sense to you? John --- John Seers Institute of Food Research Norwich Research Park Colney Norwich NR4 7UA tel +44 (0)1603 251497 fax +44 (0)1603 507723 e-mail john.seers at bbsrc.ac.uk e-disclaimer at http://www.ifr.ac.uk/edisclaimer/ Web sites: www.ifr.ac.uk www.foodandhealthnetwork.com -----Original Message----- From: Sean Davis [mailto:sdavis2@mail.nih.gov] Sent: 14 September 2006 13:57 To: john seers (IFR); bioconductor at stat.math.ethz.ch Subject: Re: [BioC] loess and limma On Thursday 14 September 2006 08:46, you wrote: > Hi Sean > > Thanks for the reply. I guess what you are confirming that it is because > there are too few points. I guess what surprised me was that there was > absolutely no trace of a result. Have you any idea how many points are > needed for the data to become "real"? John, Loess assumes that there is supposed to be no correlation between average intensity and ratio--that is why it works. What you did with your experiment was to create a situation where there was a strong correlation between ratio and average intensity, thus violating that assumption. If that assumption is wrong, as it is for your data, loess will still do the calculations, but it will remove that correlation, which in your case represents signal. So, "real" data doesn't relate to the number of points, it means that the assumption of no correlation between ratio and average intensity holds, which I have generally found to be the case for gene expression data, at least to a working approximation. If that assumption DOES NOT hold, as in your data and array CGH, loess is not the correct method for normalization. Hope that helps. Sean
ADD REPLY
0
Entering edit mode
On Thursday 14 September 2006 09:48, you wrote: > Sean > > Yes, that does help. Thank you. > > This worries me as people routinely slap in a loess normalisation. But > from what you are saying this is not likely to be valid unless you have > an independent reference. i.e. pooled or from another source. It worries me, also. One of the fantastic aspects of bioconductor is that it brings VERY powerful tools to even the novice user, so folks can go down the wrong path easily if details like assumptions used for normalization are not checked. In the case of normalization of two-color data, this is a relatively simple process, at least in principle. Make scatterplots and MA plots and check the quality of the data. There really isn't a good substitute for doing this basic step. If your data look like they have a correlation (a non-zero slope on an MA plot), that is either real biology or not. If you use loess, you are basically saying that you are assuming that it is not real biology. However, you can see that if you never look at the data, you will never know. > This means if you are using your reference channel as in a "Direct > Two-Color Design" (limma user guide Chapter 7) there is likely to be a > correlation so you should not use loess? I don't think a "Direct two-color design" is an absolute contraindication to using loess, but I may not be understanding your point here. I just got Wolfgang's email and I agree completely. When using ANY normalization method, one needs to understand the parameters, the underlying assumptions, and the data (by looking at plots and quality metrics). Sean
ADD REPLY
0
Entering edit mode
Hi John, > This worries me as people routinely slap in a loess normalisation. Non-parametric smoothing (such as loess) is an enormously powerful tool, but getting it right (especially at the boundaries) can be tricky, and there are parameters to choose such as the 'span' and the order of the local polynomials, so I am often also perplexed how people use such a heavy gun without much thought. There is some potential for over-fitting and smoothing out real signal. Well, that's my little rant for the afternoon :) Best wishes Wolfgang ------------------------------------------------------------------ Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber
ADD REPLY
0
Entering edit mode
Hi Wolfgang > so I am often also perplexed how people use such a >heavy gun without much thought. I am trying to avoid being one of those people. :) John --- John Seers Institute of Food Research Norwich Research Park Colney Norwich NR4 7UA tel +44 (0)1603 251497 fax +44 (0)1603 507723 e-mail john.seers at bbsrc.ac.uk e-disclaimer at http://www.ifr.ac.uk/edisclaimer/ Web sites: www.ifr.ac.uk www.foodandhealthnetwork.com -----Original Message----- From: Wolfgang Huber [mailto:huber@ebi.ac.uk] Sent: 14 September 2006 15:58 To: john seers (IFR) Cc: sdavis2 at mail.nih.gov; bioconductor at stat.math.ethz.ch Subject: Re: [BioC] loess and limma Hi John, > This worries me as people routinely slap in a loess normalisation. Non-parametric smoothing (such as loess) is an enormously powerful tool, but getting it right (especially at the boundaries) can be tricky, and there are parameters to choose such as the 'span' and the order of the local polynomials, so I am often also perplexed how people use such a heavy gun without much thought. There is some potential for over-fitting and smoothing out real signal. Well, that's my little rant for the afternoon :) Best wishes Wolfgang ------------------------------------------------------------------ Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber
ADD REPLY

Login before adding your answer.

Traffic: 735 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6