P values on Log or Non-Log Values

0

Entering edit mode

Park, Richard ▴ 220

@park-richard-227

Last seen 9.6 years ago

Hi Everyone, I am currently using the mt.teststat to calculate p-values between various samples. I was wondering if anyone knew if it was ok to run p-values on logged or non-logged values? In the past using MAS processing, I always calculated pvalues on the raw values, however I have recently switched to processing cel files through rma and the raw data produced from this processing is log base 2. My lab has noticed that log transformation Is not very visible with high p.values (above 0.1), but spreads them all over the place in the low (significant !) range. By running a t.test on loged values, it greatly enhances the significance (up to 100-fold, compared to running on straight values) when significance derives from tight distributions, but has very little or no effect when significance derives from more distant means Anyone have any ideas on which method is correct? thanks, Richard Park Computational Data Analyzer Joslin Diabetes Center

• 4.7k views

ADD COMMENT • link updated 21.0 years ago by James W. MacDonald 65k • written 21.0 years ago by Park, Richard ▴ 220

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 12 hours ago

United States

>From a theoretical standpoint it is more correct to do t-tests on logged data because one of the assumptions of the t-test is that the underlying data are normally distributed. Microarray expression values are almost always strongly right-skewed, and logging causes the distribution to become much more symmetrical. It is doubtful that the logged data are normally distributed, but the t-test is fairly robust to violations of the normality assumption as long as the data are relatively symmetrical. You can also permute your data to estimate the null distribution if you want to remove the reliance on normality. However, in my opinion it is still better to use symmetrical (logged) data when permuting. HTH, Jim James W. MacDonald UMCCC Microarray Core Facility 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623 >>> "Park, Richard" <richard.park@joslin.harvard.edu> 05/05/03 10:34AM >>> Hi Everyone, I am currently using the mt.teststat to calculate p-values between various samples. I was wondering if anyone knew if it was ok to run p-values on logged or non-logged values? In the past using MAS processing, I always calculated pvalues on the raw values, however I have recently switched to processing cel files through rma and the raw data produced from this processing is log base 2. My lab has noticed that log transformation Is not very visible with high p.values (above 0.1), but spreads them all over the place in the low (significant !) range. By running a t.test on loged values, it greatly enhances the significance (up to 100-fold, compared to running on straight values) when significance derives from tight distributions, but has very little or no effect when significance derives from more distant means Anyone have any ideas on which method is correct? thanks, Richard Park Computational Data Analyzer Joslin Diabetes Center _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD COMMENT • link 21.0 years ago James W. MacDonald 65k

0

Entering edit mode

At 01:36 AM 6/05/2003, James MacDonald wrote: > >From a theoretical standpoint it is more correct to do t-tests on logged > data because one of the assumptions of the t-test is that the underlying > data are normally distributed. Microarray expression values are almost > always strongly right-skewed, and logging causes the distribution to > become much more symmetrical. > >It is doubtful that the logged data are normally distributed, but the >t-test is fairly robust to violations of the normality assumption as long >as the data are relatively symmetrical. Don't forget that results on the robustness of the t-test to normality assume that (i) there are a reasonable number of objections, at least 15 say, and (ii) the p-values which need to be accurate are those around 0.05 rather than around 1e-5. Neither of these assumptions are true in the microarray context! But the main point here is, as Jim says, it has to be a whole lot better on the log-scale because the log-intensities are more symmetrically distributed. Cheers Gordon >You can also permute your data to estimate the null distribution if you >want to remove the reliance on normality. However, in my opinion it is >still better to use symmetrical (logged) data when permuting. > >HTH, > >Jim > > >James W. MacDonald >UMCCC Microarray Core Facility >1500 E. Medical Center Drive >7410 CCGC >Ann Arbor MI 48109 >734-647-5623 > > >>> "Park, Richard" <richard.park@joslin.harvard.edu> 05/05/03 10:34AM >>> >Hi Everyone, >I am currently using the mt.teststat to calculate p-values between various >samples. I was wondering if anyone knew if it was ok to run p-values on >logged or non-logged values? In the past using MAS processing, I always >calculated pvalues on the raw values, however I have recently switched to >processing cel files through rma and the raw data produced from this >processing is log base 2. > >My lab has noticed that log transformation Is not very visible with high >p.values (above 0.1), but spreads them all over the place in the low >(significant !) range. By running a t.test on loged values, it greatly >enhances the significance (up to 100-fold, compared to running on straight >values) when significance derives from tight distributions, but has very >little or no effect when significance derives from more distant means > >Anyone have any ideas on which method is correct? > >thanks, >Richard Park >Computational Data Analyzer >Joslin Diabetes Center

ADD REPLY • link 21.0 years ago Gordon Smyth 50k

0

Entering edit mode

Hi, > At 01:36 AM 6/05/2003, James MacDonald wrote: > > >From a theoretical standpoint it is more correct to do t-tests on logged > > data because one of the assumptions of the t-test is that the underlying > > data are normally distributed. Microarray expression values are almost > > always strongly right-skewed, and logging causes the distribution to > > become much more symmetrical. > ... > But the main point here is, as Jim says, it has to be a whole lot better on > the log-scale because the log-intensities are more symmetrically distributed. Blythe Durbin has done some studies on the effect of transformations on the distribution of microarray data [1], comparing raw scale, log scale, and a "generalized log", i.e. a function of the form f(x) = log(x+sqrt(x^2+c^2)) - log(2) that behaves like the log for x>>c and like a linear function for x~0. While the log is good for high intensities, for small x the log might lead to strongly fluctuating values and even create skewness, so the generalized log is in many cases a good interpolation. Another nice property of the latter is that for a suitable choice of c it can stabilize the variance, i.e. make the standard deviation of the data approximately independent of their mean. [1] http://handel.cipic.ucdavis.edu/~dmrocke/biolikelihood.pdf Chapter 3. Best regards Wolfgang

ADD REPLY • link 21.0 years ago Wolfgang Huber ★ 13k

Login before adding your answer.