qvalues, sam, limma

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 3.0 years ago

United States

I have an Affy experiment with a very high level of differential expression. It is a one-way ANOVA with 6 treatments, 2 replicates per treatment. We ran both SAM (excel version) and limma, and had very good agreement between them in terms of ranking the genes by the test statistic. For any set of the top K genes, over 90% of the genes were identified by both routines. SAM automatically produces a q-values and estimates FDR and pi_0 (the percentage of non-differentially expressing genes). I used the Bioconductor package "qvalue" to convert the limma p-values to q-values. Both routines are supposed to be based on the same paper. But the SAM q-value for the most highly differentially expressed gene is .0039, whereas the q-value from "qvalue" is 3.9e-12. The SAM q-value for the 1000th most highly differentially expressed gene is also .0039, but the value from "qvalue" is 5.6e-10. As well, "qvalue" (at FDR=0.01) is returning genes whose p-values are pretty big - e.g. p=0.12. Partly this is because the estimated pi_0 is just 7%. By contrast, SAM estimates pi_0 to be 17% and returns a much smaller list of genes at the same FDR. These genes have unadjusted p-values which are quite small. I guess if I believe SAM, I should be getting about 83% of my genes declared statistically significant - which, interestingly enough is about what I do get at FDR=.01 from "qvalue". As always, I welcome the insights of the members of this list. --Naomi Naomi S. Altman 814-865-3791 (voice) Associate Professor Bioinformatics Consulting Center Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

affy limma convert affy limma convert • 1.8k views

ADD COMMENT • link 19.9 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

At 07:16 AM 10/06/2004, Naomi Altman wrote: >I have an Affy experiment with a very high level of differential >expression. It is a one-way ANOVA with 6 treatments, 2 replicates per >treatment. > >We ran both SAM (excel version) and limma, and had very good agreement >between them in terms of ranking the genes by the test statistic. For any >set of the top K genes, over 90% of the genes were identified by both routines. As you've found here, the p-values from limma are intended to rank genes, and a general guide of significance, rather than to be believed as absolute p-values. (I've taken every opportunity to point this out on this list, in talks, and in published papers. See for example Section 7 of http://www.statsci.org/smyth/pubs/mareview.pdf.) Generally speaking, the p-values returned by limma tend to be too small. This is because the conjugate normal model being fitted is not exactly correct, most notably because the log-intensity measures are almost invariably more heavy tailed than normal. To convert ranked p-values into meaningful measures of false discovery rates is far from trivial. SAM uses a re-sampling approach, which of course depends on a particular set of assumptions and approximations. The approach which I take in my own analyses is to estimate the false discovery rate using known-null contrasts and specially designed control spots. This approach has not yet been incorporated publicly available software, but I hope it will in the future. I don't fully believe any estimator of false discovery rate that I have yet seen, as they don't consistently return believable results on real data sets. Best Gordon >SAM automatically produces a q-values and estimates FDR and pi_0 (the >percentage of non-differentially expressing genes). I used the >Bioconductor package "qvalue" to convert the limma p-values to >q-values. Both routines are supposed to be based on the same paper. But >the SAM q-value for the most highly differentially expressed gene is >.0039, whereas the q-value from "qvalue" is 3.9e-12. The SAM q-value for >the 1000th most highly differentially expressed gene is also .0039, but >the value from "qvalue" is 5.6e-10. > >As well, "qvalue" (at FDR=0.01) is returning genes whose p-values are >pretty big - e.g. p=0.12. Partly this is because the estimated pi_0 is >just 7%. By contrast, SAM estimates pi_0 to be 17% and returns a much >smaller list of genes at the same FDR. These genes have unadjusted >p-values which are quite small. > >I guess if I believe SAM, I should be getting about 83% of my genes >declared statistically significant - which, interestingly enough is about >what I do get at FDR=.01 from "qvalue". > >As always, I welcome the insights of the members of this list. > >--Naomi

ADD COMMENT • link 19.9 years ago Gordon Smyth 50k

0

Entering edit mode

Marcus Davy ▴ 680

@marcus-davy-374

Last seen 9.6 years ago

Hi, I can make an observation about the differences in pi_0 estimation between the packages siggenes (SAM analysis) and qvalue. > packageDescription("qvalue", field="Version") [1] "1.0.3" > packageDescription("siggenes", field="Version") [1] "1.0.6" > The package siggenes uses the function "p0.est" which has the following cubic spline code: ... spline.out <- smooth.spline(vec.lambda, vec.p0, w = 1 - vec.lambda, df = 3) p0 <- min(predict(spline.out, 1)$y, 1). ... The package qvalue uses the function "qvalue" which has the following cubic spline code: ... spi0 <- smooth.spline(lambda, pi0, df = 3) pi0 <- predict(spi0, x = max(lambda))$y ... The estimate pi_0(lambda) over a range of lambda observations becomes unbiased as lambda -> 1, with a bias variance tradeoff. The real difference is that observations are not weighted by 1-lambda in the package qvalue, and the estimate of pi_0 is at pi_0(lambda=0.95) by default in the arguements (lambda = seq(0, 0.95, 0.05)). In the package siggenes, pi_0 is estimated at pi_0(lambda=1). By weighting the observations with 1-lambda assumes values with small lambda are trusted to be more accurate. Details of the weighting of 1-lambda is in the preprint on page 13 of: Storey J. D. and Tibshirani R. J. Statistical significance for genome-wide experiments. Preprint, January 2003. I have run simulations that suggest that including the 1-lambda weights can improve the variance in the estimation of pi_0(lambda) over the entire range of pi_0. marcus >>> Naomi Altman <naomi@stat.psu.edu> 10/06/2004 9:16:00 AM >>> I have an Affy experiment with a very high level of differential expression. It is a one-way ANOVA with 6 treatments, 2 replicates per treatment. We ran both SAM (excel version) and limma, and had very good agreement between them in terms of ranking the genes by the test statistic. For any set of the top K genes, over 90% of the genes were identified by both routines. SAM automatically produces a q-values and estimates FDR and pi_0 (the percentage of non-differentially expressing genes). I used the Bioconductor package "qvalue" to convert the limma p-values to q-values. Both routines are supposed to be based on the same paper. But the SAM q-value for the most highly differentially expressed gene is .0039, whereas the q-value from "qvalue" is 3.9e-12. The SAM q-value for the 1000th most highly differentially expressed gene is also .0039, but the value from "qvalue" is 5.6e-10. As well, "qvalue" (at FDR=0.01) is returning genes whose p-values are pretty big - e.g. p=0.12. Partly this is because the estimated pi_0 is just 7%. By contrast, SAM estimates pi_0 to be 17% and returns a much smaller list of genes at the same FDR. These genes have unadjusted p-values which are quite small. I guess if I believe SAM, I should be getting about 83% of my genes declared statistically significant - which, interestingly enough is about what I do get at FDR=.01 from "qvalue". As always, I welcome the insights of the members of this list. --Naomi Naomi S. Altman 814-865-3791 (voice) Associate Professor Bioinformatics Consulting Center Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111 _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor ______________________________________________________ The contents of this e-mail are privileged and/or confidenti...{{dropped}}

ADD COMMENT • link 19.9 years ago Marcus Davy ▴ 680

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 3.0 years ago

United States

Of course, qvalue uses the p-values generated by another routine (in my case, by limma) and SAM generates the q-values from the test statistic. However, just for fun, I made the following changes to qvalue: > spi0 <- smooth.spline(lambda, pi0, df = 3,w=1-lambda) > pi0 <- predict(spi0, x = 1)$y ... SAM (excel version) estimates pi0=0.17. qvalue (vanilla) estimates pi0=0.07. qvalue (revised) estimates pi0=0.066 and the q-values are almost identical. So, I would say that the differences between SAM and limma have to do with the estimated p-values. While Gordon suggested that the distribution may not be the proposed F-distribution, I suspect that the problem is the level of differential expression, which is probably over 70% (although possibly not as high as 80-90% as suggested by the analysis). It is difficult to establish the null distribution from the data when pi0 is very small. --Naomi At 12:01 PM 6/10/2004 +1200, Marcus Davy wrote: >Hi, >I can make an observation about the differences in pi_0 estimation >between >the packages siggenes (SAM analysis) and qvalue. > > > packageDescription("qvalue", field="Version") >[1] "1.0.3" > > packageDescription("siggenes", field="Version") >[1] "1.0.6" > > > >The package siggenes uses the function "p0.est" which has the following > >cubic spline code: >... > spline.out <- smooth.spline(vec.lambda, vec.p0, w = 1 - vec.lambda, > > df = 3) > p0 <- min(predict(spline.out, 1)$y, 1). >... >The package qvalue uses the function "qvalue" which has the following >cubic spline code: >... > spi0 <- smooth.spline(lambda, pi0, df = 3) > pi0 <- predict(spi0, x = max(lambda))$y >... > >The estimate pi_0(lambda) over a range of lambda observations becomes >unbiased as lambda -> 1, with a bias variance tradeoff. > >The real difference is that observations are not weighted by 1-lambda >in the package qvalue, and the estimate of pi_0 is at pi_0(lambda=0.95) > >by default in the arguements (lambda = seq(0, 0.95, 0.05)). >In the package siggenes, pi_0 is estimated at pi_0(lambda=1). > >By weighting the observations with 1-lambda assumes values with small >lambda are trusted to be more accurate. > >Details of the weighting of 1-lambda is in the preprint on page 13 of: > >Storey J. D. and Tibshirani R. J. Statistical significance for >genome-wide >experiments. Preprint, January 2003. > >I have run simulations that suggest that including the 1-lambda weights > >can improve the variance in the estimation of pi_0(lambda) over the >entire >range of pi_0. > > >marcus > > > >>> Naomi Altman <naomi@stat.psu.edu> 10/06/2004 9:16:00 AM >>> >I have an Affy experiment with a very high level of differential >expression. It is a one-way ANOVA with 6 treatments, 2 replicates per > >treatment. > >We ran both SAM (excel version) and limma, and had very good agreement > >between them in terms of ranking the genes by the test statistic. For >any >set of the top K genes, over 90% of the genes were identified by both >routines. > >SAM automatically produces a q-values and estimates FDR and pi_0 (the >percentage of non-differentially expressing genes). I used the >Bioconductor package "qvalue" to convert the limma p-values to >q-values. Both routines are supposed to be based on the same paper. >But >the SAM q-value for the most highly differentially expressed gene is >.0039, >whereas the q-value from "qvalue" is 3.9e-12. The SAM q-value for the > >1000th most highly differentially expressed gene is also .0039, but the > >value from "qvalue" is 5.6e-10. > >As well, "qvalue" (at FDR=0.01) is returning genes whose p-values are >pretty big - e.g. p=0.12. Partly this is because the estimated pi_0 is > >just 7%. By contrast, SAM estimates pi_0 to be 17% and returns a much > >smaller list of genes at the same FDR. These genes have unadjusted >p-values which are quite small. > >I guess if I believe SAM, I should be getting about 83% of my genes >declared statistically significant - which, interestingly enough is >about >what I do get at FDR=.01 from "qvalue". > >As always, I welcome the insights of the members of this list. > >--Naomi > > > >Naomi S. Altman 814-865-3791 (voice) >Associate Professor >Bioinformatics Consulting Center >Dept. of Statistics 814-863-7114 (fax) >Penn State University 814-865-1348 >(Statistics) >University Park, PA 16802-2111 > >_______________________________________________ >Bioconductor mailing list >Bioconductor@stat.math.ethz.ch >https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > >______________________________________________________ > >The contents of this e-mail are privileged and/or confidential to the >named recipient and are not to be used by any other person and/or >organisation. If you have received this e-mail in error, please notify >the sender and delete all material pertaining to this e-mail. >______________________________________________________ Naomi S. Altman 814-865-3791 (voice) Associate Professor Bioinformatics Consulting Center Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD COMMENT • link 19.9 years ago Naomi Altman ★ 6.0k

Login before adding your answer.