I have an Affy experiment with a very high level of differential
expression. It is a one-way ANOVA with 6 treatments, 2 replicates per
treatment.
We ran both SAM (excel version) and limma, and had very good agreement
between them in terms of ranking the genes by the test statistic. For
any
set of the top K genes, over 90% of the genes were identified by both
routines.
SAM automatically produces a q-values and estimates FDR and pi_0 (the
percentage of non-differentially expressing genes). I used the
Bioconductor package "qvalue" to convert the limma p-values to
q-values. Both routines are supposed to be based on the same paper.
But
the SAM q-value for the most highly differentially expressed gene is
.0039,
whereas the q-value from "qvalue" is 3.9e-12. The SAM q-value for the
1000th most highly differentially expressed gene is also .0039, but
the
value from "qvalue" is 5.6e-10.
As well, "qvalue" (at FDR=0.01) is returning genes whose p-values are
pretty big - e.g. p=0.12. Partly this is because the estimated pi_0
is
just 7%. By contrast, SAM estimates pi_0 to be 17% and returns a much
smaller list of genes at the same FDR. These genes have unadjusted
p-values which are quite small.
I guess if I believe SAM, I should be getting about 83% of my genes
declared statistically significant - which, interestingly enough is
about
what I do get at FDR=.01 from "qvalue".
As always, I welcome the insights of the members of this list.
--Naomi
Naomi S. Altman 814-865-3791 (voice)
Associate Professor
Bioinformatics Consulting Center
Dept. of Statistics 814-863-7114 (fax)
Penn State University 814-865-1348
(Statistics)
University Park, PA 16802-2111
At 07:16 AM 10/06/2004, Naomi Altman wrote:
>I have an Affy experiment with a very high level of differential
>expression. It is a one-way ANOVA with 6 treatments, 2 replicates
per
>treatment.
>
>We ran both SAM (excel version) and limma, and had very good
agreement
>between them in terms of ranking the genes by the test statistic.
For any
>set of the top K genes, over 90% of the genes were identified by both
routines.
As you've found here, the p-values from limma are intended to rank
genes,
and a general guide of significance, rather than to be believed as
absolute
p-values. (I've taken every opportunity to point this out on this
list, in
talks, and in published papers. See for example Section 7 of
http://www.statsci.org/smyth/pubs/mareview.pdf.)
Generally speaking, the p-values returned by limma tend to be too
small.
This is because the conjugate normal model being fitted is not exactly
correct, most notably because the log-intensity measures are almost
invariably more heavy tailed than normal.
To convert ranked p-values into meaningful measures of false discovery
rates is far from trivial. SAM uses a re-sampling approach, which of
course
depends on a particular set of assumptions and approximations. The
approach
which I take in my own analyses is to estimate the false discovery
rate
using known-null contrasts and specially designed control spots. This
approach has not yet been incorporated publicly available software,
but I
hope it will in the future.
I don't fully believe any estimator of false discovery rate that I
have yet
seen, as they don't consistently return believable results on real
data sets.
Best
Gordon
>SAM automatically produces a q-values and estimates FDR and pi_0 (the
>percentage of non-differentially expressing genes). I used the
>Bioconductor package "qvalue" to convert the limma p-values to
>q-values. Both routines are supposed to be based on the same paper.
But
>the SAM q-value for the most highly differentially expressed gene is
>.0039, whereas the q-value from "qvalue" is 3.9e-12. The SAM q-value
for
>the 1000th most highly differentially expressed gene is also .0039,
but
>the value from "qvalue" is 5.6e-10.
>
>As well, "qvalue" (at FDR=0.01) is returning genes whose p-values are
>pretty big - e.g. p=0.12. Partly this is because the estimated pi_0
is
>just 7%. By contrast, SAM estimates pi_0 to be 17% and returns a
much
>smaller list of genes at the same FDR. These genes have unadjusted
>p-values which are quite small.
>
>I guess if I believe SAM, I should be getting about 83% of my genes
>declared statistically significant - which, interestingly enough is
about
>what I do get at FDR=.01 from "qvalue".
>
>As always, I welcome the insights of the members of this list.
>
>--Naomi
Hi,
I can make an observation about the differences in pi_0 estimation
between
the packages siggenes (SAM analysis) and qvalue.
> packageDescription("qvalue", field="Version")
[1] "1.0.3"
> packageDescription("siggenes", field="Version")
[1] "1.0.6"
>
The package siggenes uses the function "p0.est" which has the
following
cubic spline code:
...
spline.out <- smooth.spline(vec.lambda, vec.p0, w = 1 -
vec.lambda,
df = 3)
p0 <- min(predict(spline.out, 1)$y, 1).
...
The package qvalue uses the function "qvalue" which has the following
cubic spline code:
...
spi0 <- smooth.spline(lambda, pi0, df = 3)
pi0 <- predict(spi0, x = max(lambda))$y
...
The estimate pi_0(lambda) over a range of lambda observations becomes
unbiased as lambda -> 1, with a bias variance tradeoff.
The real difference is that observations are not weighted by 1-lambda
in the package qvalue, and the estimate of pi_0 is at
pi_0(lambda=0.95)
by default in the arguements (lambda = seq(0, 0.95, 0.05)).
In the package siggenes, pi_0 is estimated at pi_0(lambda=1).
By weighting the observations with 1-lambda assumes values with small
lambda are trusted to be more accurate.
Details of the weighting of 1-lambda is in the preprint on page 13 of:
Storey J. D. and Tibshirani R. J. Statistical significance for
genome-wide
experiments. Preprint, January 2003.
I have run simulations that suggest that including the 1-lambda
weights
can improve the variance in the estimation of pi_0(lambda) over the
entire
range of pi_0.
marcus
>>> Naomi Altman <naomi@stat.psu.edu> 10/06/2004 9:16:00 AM >>>
I have an Affy experiment with a very high level of differential
expression. It is a one-way ANOVA with 6 treatments, 2 replicates per
treatment.
We ran both SAM (excel version) and limma, and had very good agreement
between them in terms of ranking the genes by the test statistic. For
any
set of the top K genes, over 90% of the genes were identified by both
routines.
SAM automatically produces a q-values and estimates FDR and pi_0 (the
percentage of non-differentially expressing genes). I used the
Bioconductor package "qvalue" to convert the limma p-values to
q-values. Both routines are supposed to be based on the same paper.
But
the SAM q-value for the most highly differentially expressed gene is
.0039,
whereas the q-value from "qvalue" is 3.9e-12. The SAM q-value for the
1000th most highly differentially expressed gene is also .0039, but
the
value from "qvalue" is 5.6e-10.
As well, "qvalue" (at FDR=0.01) is returning genes whose p-values are
pretty big - e.g. p=0.12. Partly this is because the estimated pi_0
is
just 7%. By contrast, SAM estimates pi_0 to be 17% and returns a much
smaller list of genes at the same FDR. These genes have unadjusted
p-values which are quite small.
I guess if I believe SAM, I should be getting about 83% of my genes
declared statistically significant - which, interestingly enough is
about
what I do get at FDR=.01 from "qvalue".
As always, I welcome the insights of the members of this list.
--Naomi
Naomi S. Altman 814-865-3791 (voice)
Associate Professor
Bioinformatics Consulting Center
Dept. of Statistics 814-863-7114 (fax)
Penn State University 814-865-1348
(Statistics)
University Park, PA 16802-2111
_______________________________________________
Bioconductor mailing list
Bioconductor@stat.math.ethz.ch
https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
______________________________________________________
The contents of this e-mail are privileged and/or
confidenti...{{dropped}}
Of course, qvalue uses the p-values generated by another routine (in
my
case, by limma) and SAM generates the q-values from the test
statistic. However, just for fun, I made the following changes to
qvalue:
> spi0 <- smooth.spline(lambda, pi0, df = 3,w=1-lambda)
> pi0 <- predict(spi0, x = 1)$y ...
SAM (excel version) estimates pi0=0.17.
qvalue (vanilla) estimates pi0=0.07.
qvalue (revised) estimates pi0=0.066
and the q-values are almost identical.
So, I would say that the differences between SAM and limma have to do
with
the estimated p-values. While Gordon suggested that the distribution
may
not be the proposed F-distribution, I suspect that the problem is the
level
of differential expression, which is probably over 70% (although
possibly
not as high as 80-90% as suggested by the analysis). It is difficult
to
establish the null distribution from the data when pi0 is very small.
--Naomi
At 12:01 PM 6/10/2004 +1200, Marcus Davy wrote:
>Hi,
>I can make an observation about the differences in pi_0 estimation
>between
>the packages siggenes (SAM analysis) and qvalue.
>
> > packageDescription("qvalue", field="Version")
>[1] "1.0.3"
> > packageDescription("siggenes", field="Version")
>[1] "1.0.6"
> >
>
>The package siggenes uses the function "p0.est" which has the
following
>
>cubic spline code:
>...
> spline.out <- smooth.spline(vec.lambda, vec.p0, w = 1 -
vec.lambda,
>
> df = 3)
> p0 <- min(predict(spline.out, 1)$y, 1).
>...
>The package qvalue uses the function "qvalue" which has the following
>cubic spline code:
>...
> spi0 <- smooth.spline(lambda, pi0, df = 3)
> pi0 <- predict(spi0, x = max(lambda))$y
>...
>
>The estimate pi_0(lambda) over a range of lambda observations becomes
>unbiased as lambda -> 1, with a bias variance tradeoff.
>
>The real difference is that observations are not weighted by 1-lambda
>in the package qvalue, and the estimate of pi_0 is at
pi_0(lambda=0.95)
>
>by default in the arguements (lambda = seq(0, 0.95, 0.05)).
>In the package siggenes, pi_0 is estimated at pi_0(lambda=1).
>
>By weighting the observations with 1-lambda assumes values with small
>lambda are trusted to be more accurate.
>
>Details of the weighting of 1-lambda is in the preprint on page 13
of:
>
>Storey J. D. and Tibshirani R. J. Statistical significance for
>genome-wide
>experiments. Preprint, January 2003.
>
>I have run simulations that suggest that including the 1-lambda
weights
>
>can improve the variance in the estimation of pi_0(lambda) over the
>entire
>range of pi_0.
>
>
>marcus
>
>
> >>> Naomi Altman <naomi@stat.psu.edu> 10/06/2004 9:16:00 AM >>>
>I have an Affy experiment with a very high level of differential
>expression. It is a one-way ANOVA with 6 treatments, 2 replicates
per
>
>treatment.
>
>We ran both SAM (excel version) and limma, and had very good
agreement
>
>between them in terms of ranking the genes by the test statistic.
For
>any
>set of the top K genes, over 90% of the genes were identified by both
>routines.
>
>SAM automatically produces a q-values and estimates FDR and pi_0 (the
>percentage of non-differentially expressing genes). I used the
>Bioconductor package "qvalue" to convert the limma p-values to
>q-values. Both routines are supposed to be based on the same paper.
>But
>the SAM q-value for the most highly differentially expressed gene is
>.0039,
>whereas the q-value from "qvalue" is 3.9e-12. The SAM q-value for
the
>
>1000th most highly differentially expressed gene is also .0039, but
the
>
>value from "qvalue" is 5.6e-10.
>
>As well, "qvalue" (at FDR=0.01) is returning genes whose p-values are
>pretty big - e.g. p=0.12. Partly this is because the estimated pi_0
is
>
>just 7%. By contrast, SAM estimates pi_0 to be 17% and returns a
much
>
>smaller list of genes at the same FDR. These genes have unadjusted
>p-values which are quite small.
>
>I guess if I believe SAM, I should be getting about 83% of my genes
>declared statistically significant - which, interestingly enough is
>about
>what I do get at FDR=.01 from "qvalue".
>
>As always, I welcome the insights of the members of this list.
>
>--Naomi
>
>
>
>Naomi S. Altman 814-865-3791 (voice)
>Associate Professor
>Bioinformatics Consulting Center
>Dept. of Statistics 814-863-7114 (fax)
>Penn State University 814-865-1348
>(Statistics)
>University Park, PA 16802-2111
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor@stat.math.ethz.ch
>https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
>
>______________________________________________________
>
>The contents of this e-mail are privileged and/or confidential to the
>named recipient and are not to be used by any other person and/or
>organisation. If you have received this e-mail in error, please
notify
>the sender and delete all material pertaining to this e-mail.
>______________________________________________________
Naomi S. Altman 814-865-3791 (voice)
Associate Professor
Bioinformatics Consulting Center
Dept. of Statistics 814-863-7114 (fax)
Penn State University 814-865-1348
(Statistics)
University Park, PA 16802-2111