Hi Sean,
In this situation I would hope it is a one sided test. I had this
same discussion with a colleague who wanted the same thing. I don't
think testing for under-representation means anything. Think about
the context, one is doing recursive sampling of a finite of a finite
population for which there are two sources of bias, what is
represented
in the database or on the chip, and what is annotated on the chip.
Further you are testing at each node the discrepency from random,
as you go down the DAG zero becomes more and more probable, you can
think
of it as doing a mark-recapture study on your genes. This problem is
exacerbated
by the sampling bias. Finally, a last complication is that test is
further biased by your ability to detect differentially expressed
genes.
At least if you detect over-representation you can argue for a strong
signal.
Nicholas
>
> Message: 4
> Date: Wed, 22 Dec 2004 11:02:55 -0500
> From: Sean Davis <sdavis2@mail.nih.gov>
> Subject: [BioC] GoHyperG
> To: Bioconductor <bioconductor@stat.math.ethz.ch>
> Message-ID: <f0ee8e4b-5432-11d9-accb-000d933565e8@mail.nih.gov>
> Content-Type: text/plain; charset=US-ASCII; format=flowed
>
> Just a quick question--are the p-values from gohyperg one- or
> two-sided? I have a collaborator who would like to use it to
determine
> underrepresented ontology categories.
>
> Thanks,
> Sean
>
>
>
On Dec 23, 2004, at 10:52 AM, Nicholas Lewin-Koh wrote:
> Hi Sean,
> In this situation I would hope it is a one sided test. I had this
> same discussion with a colleague who wanted the same thing. I don't
> think testing for under-representation means anything. Think about
> the context, one is doing recursive sampling of a finite of a finite
> population for which there are two sources of bias, what is
represented
> in the database or on the chip, and what is annotated on the chip.
> Further you are testing at each node the discrepency from random,
> as you go down the DAG zero becomes more and more probable, you can
> think
> of it as doing a mark-recapture study on your genes. This problem is
> exacerbated
> by the sampling bias. Finally, a last complication is that test is
> further biased by your ability to detect differentially expressed
> genes.
> At least if you detect over-representation you can argue for a
strong
> signal.
I'm being a bit dense, but suppose I have 10000 genes on a chip
(annotated in ontology Y), 1000 of which are annotated as category X;
I
find 1000 differentially-expressed genes (annotated in ontology Y)
from
that chip, but only 12 are from category X. Is that not interesting
to
know about?
As for finding zeros, as it becomes more probable as one moves down
the
DAG, of course finding "underrepresented" groups becomes prohibitively
difficult, but for large categories is certainly possible. As for
biases, I'm not sure that I agree that ability to detect
differentially-expressed genes is a source of "bias". It is certainly
a limitation, but I don't think a bias. And I'm not sure what
"sampling bias" might be present?
Thanks for the food for thought.
Sean
>>
>> Message: 4
>> Date: Wed, 22 Dec 2004 11:02:55 -0500
>> From: Sean Davis <sdavis2@mail.nih.gov>
>> Subject: [BioC] GoHyperG
>> To: Bioconductor <bioconductor@stat.math.ethz.ch>
>> Message-ID: <f0ee8e4b-5432-11d9-accb-000d933565e8@mail.nih.gov>
>> Content-Type: text/plain; charset=US-ASCII; format=flowed
>>
>> Just a quick question--are the p-values from gohyperg one- or
>> two-sided? I have a collaborator who would like to use it to
>> determine
>> underrepresented ontology categories.
>>
>> Thanks,
>> Sean
>>
>>
>>
Hi Sean,
My answer is below,
On Thu, 23 Dec 2004 11:29:45 -0500, "Sean Davis"
<sdavis2@mail.nih.gov>
said:
>
> On Dec 23, 2004, at 10:52 AM, Nicholas Lewin-Koh wrote:
>
> > Hi Sean,
> > In this situation I would hope it is a one sided test. I had this
> > same discussion with a colleague who wanted the same thing. I
don't
> > think testing for under-representation means anything. Think about
> > the context, one is doing recursive sampling of a finite of a
finite
> > population for which there are two sources of bias, what is
represented
> > in the database or on the chip, and what is annotated on the chip.
> > Further you are testing at each node the discrepency from random,
> > as you go down the DAG zero becomes more and more probable, you
can
> > think
> > of it as doing a mark-recapture study on your genes. This problem
is
> > exacerbated
> > by the sampling bias. Finally, a last complication is that test is
> > further biased by your ability to detect differentially expressed
> > genes.
> > At least if you detect over-representation you can argue for a
strong
> > signal.
>
> I'm being a bit dense, but suppose I have 10000 genes on a chip
> (annotated in ontology Y), 1000 of which are annotated as category
X; I
> find 1000 differentially-expressed genes (annotated in ontology Y)
from
> that chip, but only 12 are from category X. Is that not interesting
to
> know about?
I'd say probably not. More likely the 12 genes represent
over-representation
somewhere down the DAG, or they are due to genes that overlap
categories
and are part of another set of groups that is expressed.
If you do detect under representation how would you interpret it? I
don't
see how there would be a biological interpretation (mind you I am not
a
biologist)
unless you had a distinct hypothesis about a group that should be
expressed under the treatment,
in which case this is probably the wrong approach and something like
Jelle Goeman's global
test would be much more appropriate.
>
> As for finding zeros, as it becomes more probable as one moves down
the
> DAG, of course finding "underrepresented" groups becomes
prohibitively
> difficult, but for large categories is certainly possible. As for
> biases, I'm not sure that I agree that ability to detect
> differentially-expressed genes is a source of "bias". It is
certainly
> a limitation, but I don't think a bias. And I'm not sure what
> "sampling bias" might be present?
Look at the parameters in the hypergeometric. The idea behind the
hypergeometric
is sampling from a finite population. We have a finite population N
but,
N
is conditional on the probes being annotated and represented on the
chip. So from
that perspective we are conditionally unbiased. But at each level of
refinement in go
we can expect that annotation will be more variable, so we are
"losing"
genes as
the functions become more refined. It is like dropping marbles through
leaky pipes
and trying to estimate the total by what drops through at the bottom.
Anyway I'm drinking eggnog as I write so I may not be making as much
sense as I think I am.
A merry christmas to you.
Nicholas
>
> Thanks for the food for thought.
>
> Sean
>
>
>
> >>
> >> Message: 4
> >> Date: Wed, 22 Dec 2004 11:02:55 -0500
> >> From: Sean Davis <sdavis2@mail.nih.gov>
> >> Subject: [BioC] GoHyperG
> >> To: Bioconductor <bioconductor@stat.math.ethz.ch>
> >> Message-ID: <f0ee8e4b-5432-11d9-accb-000d933565e8@mail.nih.gov>
> >> Content-Type: text/plain; charset=US-ASCII; format=flowed
> >>
> >> Just a quick question--are the p-values from gohyperg one- or
> >> two-sided? I have a collaborator who would like to use it to
> >> determine
> >> underrepresented ontology categories.
> >>
> >> Thanks,
> >> Sean
> >>
> >>
> >>
>