Question

Limma: Normalization with large numbers of differentially expressed genes

0

Entering edit mode

Serge Eifes ▴ 90

@serge-eifes-2032

Last seen 9.6 years ago

Dear all, We have performed a time-series experiment (2h, 6h, 10h, 48h, 72h) on dual-channel arrays where we want to compare gene expression between treated and time-matched untreated cells. This experiment was done using Agilent 4112F human whole genome microarrays (with 45k features). Statistical analysis is performed using LIMMA 2.10.7 on R 2.5.1. Background correction was performed using normexp with an offset of 100. Loess normalization was done using a span of 0.4 and 12 iterations. Now I have encountered the following problems during data analysis: 1) The microarrays for the whole experiment were scanned at quite low intensities. This means that about 22k features on average per array have an A-value located between 7 and 8. 2) It seems as there are also quite large numbers of differentially expressed probes when considering the raw per-probe p-values from the moderated t-test for the different time-points and the p-values for the moderated F-statistic after MHC (FDR, BH). Numbers of significant probes with raw per-probe p-value < 0.05 from moderated t-test as retrieved from the "MArrayLM" object are shown here: * t=0h: 1419 * t=2h: 9428 * t=6h: 15013 * t=10h: 13641 * t=48h: 21713 * t=72h: 18027 Here are shown the number of significant probes I get by using moderated F-statistic (nestedF) with p<0.05 after MHC: * t=0: 515 * t=2h: 6278 * t=6h: 11460 * t=10h: 10560 * t=48h: 17250 * t=72h: 14311 Now I've got the following questions: * Is the accumulation of signals at such low average intensities problematic for the normalization process (beside that it may introduce a higher variability into the measurements)? * I already read in a reply by G.K. Smyth ([BioC] limma Normalization question) that loess normalization might get problematic when having around 20% of differentially expressed genes. So in this case, does Loess normalization still work correctly, considering such large numbers of differentially expressed genes? If not, what kind of normalization may be more appropriate for this kind of data. Thanks in advance! Best Regards, Serge Eifes Serge Eifes Laboratoire de Biologie Moleculaire et Cellulaire du Cancer (LBMCC) Hopital Kirchberg 9,rue Edward steichen L-2540 LUXEMBOURG

Normalization Cancer limma PROcess Normalization Cancer limma PROcess • 1.4k views

ADD COMMENT • link updated 16.6 years ago by J.delasHeras@ed.ac.uk ★ 1.9k • written 16.6 years ago by Serge Eifes ▴ 90

score 0 · Answer 1 · 2007-10-10

Quoting Serge Eifes <serge.eifes at="" lbmcc.lu="">: > > Dear all, > > We have performed a time-series experiment (2h, 6h, 10h, 48h, 72h) on > dual-channel arrays where we want to compare gene expression between treated > and time-matched untreated cells. > > This experiment was done using Agilent 4112F human whole genome microarrays > (with 45k features). Statistical analysis is performed using LIMMA 2.10.7 on > R 2.5.1. > Background correction was performed using normexp with an offset of 100. > Loess normalization was done using a span of 0.4 and 12 iterations. > > Now I have encountered the following problems during data analysis: > > 1) The microarrays for the whole experiment were scanned at quite low > intensities. This means that about 22k features on average per array have an > A-value located between 7 and 8. > > 2) It seems as there are also quite large numbers of differentially > expressed probes when considering the raw per-probe p-values from the > moderated t-test for the different time-points and the p-values for the > moderated F-statistic after MHC (FDR, BH). > > Numbers of significant probes with raw per-probe p-value < 0.05 from > moderated t-test as retrieved from the "MArrayLM" object are shown here: > * t=0h: 1419 > * t=2h: 9428 > * t=6h: 15013 > * t=10h: 13641 > * t=48h: 21713 > * t=72h: 18027 > > Here are shown the number of significant probes I get by using moderated > F-statistic (nestedF) with p<0.05 after MHC: > * t=0: 515 > * t=2h: 6278 > * t=6h: 11460 > * t=10h: 10560 > * t=48h: 17250 > * t=72h: 14311 > > Now I've got the following questions: > > * Is the accumulation of signals at such low average intensities problematic > for the normalization process (beside that it may introduce a higher > variability into the measurements)? > > * I already read in a reply by G.K. Smyth ([BioC] limma Normalization > question) that loess normalization might get problematic when having around > 20% of differentially expressed genes. So in this case, does Loess > normalization still work correctly, considering such large numbers of > differentially expressed genes? If not, what kind of normalization may be > more appropriate for this kind of data. > > Thanks in advance! > > Best Regards, > Serge Eifes Hi Serge, having a lot of spots with low intensity would only add noise but not create much problem for normalisation. You used the normexp method for background correction, which can be very good, when used with an appropriate offset, to make the M values of low intensity spots converge nicely towards zero, so i wouldn't worry excessively about that. regarding having a large % of differentially expressed genes... that's more of a problem. The quote of 20% sounds like a conservative estimate, but it does really depend on how those 20% of spots are distributed... and you may get away with more... Loess is simply used to fit a curve to teh population, and teh assumption is made that this represents the non-changing baseline... where spots with no differential expressions should align. This of course assumes that most of teh data are evenly distributed on both sides of the curve, more or less... and these assumptions are generally okay, and even some deviations are tolerated. But you have to look at each experiment and decide. What do teh MA plots look like? Looking at MA plots you can see the distribution of M values (before normalisation, so make an MA object using normalisation between arrays, method="none"). You can compare those plots with MA plots after normalisation, to see teh efect the normalisation procedure has on the whole distribution. You might find that loess will distort the distribution in ways that do not seem reasonable, when there are too many differentially expressed genes. How many is too many? It depends. It depends on the number, but also on their distribution across intensities... MA plots are the best to check this sort of thing. I had an experiment that resulted in a large number of genes being activated (going from low or no expression to a decent level). The MA plot looked something like this (combining several slides, after lmfit): http://mcnach.com/MISC/MAplots2.png When using loess normalisation, my activated spots contributed excessively to the total population, especially between the ranges A=11 to A=12.5 or so... the resulting loess curve was clearly pushed up in that area, and the resulting normalised data was distorted, being pushed down. For this sort of cases the best is to have a set of known invariant spots, or control spots whose behaviour is expected, and use those to normalise the whole thing. But often we don't have those. In the case above, I was able to identify reasonably easily a large number of those genes that were being activated, and I could flag them so that they would not be included in the normalisation. By removing a reasonable proportion of them I was able to eliminate the distortion and the final plots look reasonable to me. I took a lot of time to verify genes and make sure that everything was behaving alright, so I was happy with this method. However, it requires that you are familiar with the biology of teh experiment, and that you check and recheck that what you're doing doesn't cause harm. On the positive side... when I compared the results I got when using loess directly on all spots (despite distortion) and with my more carefully chosen ones... I found that whilst the latter was better in general, I could still pick out pretty much the same genes either way. Perhaps I was looking for a population that was already distinct enough... I'm not sure this is of any help to you right now... I guess the bottom line is: make plots, before and after normalisation, have a good idea of what you are expecting and see how far it is from what you get. Loess is just fitting a curve to the distribution, according to certain parameters... if you think you know what the curve should look like (representing the non-changing bulk of teh data), you can often find a work-around... as long as you know what is expected i your experiment, to some degree. Without proper control spots, one has to be careful, and understand the experiment. Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.