On Tuesday 24 February 2004 09:33, Nicholas Lewin-Koh wrote:
> Hi all,
> I have a few questions about testing for over representation of
> a cluster.
> let's consider a simple case, a set of chips from an experiment say
> treated and untreted with 10,000
> genes on the chip and 1000 differentially expressed. Of the 10000,
> can be annotated and 6000 have
> a GO function assinged to them at a suitible level. Say for this
> there are 30 Go clasess that appear.
> I then conduct Fisher's exact test 30 times on each GO category to
> differential representation of terms in the expressed
> set and correct for multiple testing.
I think I understand your setup. Just to double check, let me rephrase
- for every one of the 30 GO terms, you set up a 2x2 contingency table
with/without the GO term by genes in class A vs. genes in class B),
out a Fisher's exact test, so you do 30 tests.
However, I am not sure what you mean by "7000 can be annotated and
a GO function assinged to them at a suitible level". Does this mean
that, if a
gene has no GO annotation you will not introduce it into the above 2x2
tables? It could be in the table (so the sum of entries in each of the
10000); it just goes to the "absent" cells.
> My question is on the validity of this procedure. Just from
> many genes will
> have multiple functions assigned to them so the genes falling into
> classes are not independent.
Yes, sure, though I'd rather reword it as saying that the
a GO term X (e.g., metabolism) is not independent of the
GO term Y (e.g., transport).
However, I don't see this as an inherent problem. Suppose you measure
length, body mass, and height, of a bunch of men and women, and carry
three t-tests. Of course, the three variables are correlated.
Now, you might have used Hotelling's T-test for testing the null
that the multivariate mean (in the space defined by the three traits)
sexes do not differ. But that is a different biological question from
"do they differ in any one of the three traits", which is what you
asking if you run 3 t-tests. [Some of these issues are discussed very
by W. Krzanowski in "Principles of multivariate analysis", pp. 235
the 1988 edition, and in the categorical variable case by Fienberg,
analysis of cross-classified categorical data, 2nd ed", in pp. 20-21].
>From the above point of view, I think that many of the examples in
Young ("Resampling-based multiple testing") could also be reframed in
multivariate way. But they are not. The reason, I think, is that in
these cases (i.e., FatiGO, Westfall & Young, etc) the biologists are
interested in fishing in a sea of univariate hypotheses. I think that
the questions that biologists are asking in these cases are often
A multivariate alternative would be to use a log-linear model of a
contingency table: we have 10000 genes that we cross-classify
group membership (differentially expressed or not), and each of the K
= 30 GO
terms (with two values for each term: present or basent). So we have a
multidimensional table of 2 x 2^30. This won't work.
> Also, there is the large set of un-annotated genes so we are in
> ignoring the influence of
> all the unannotated genes on the outcome.
This relates to the more general problem of the quality of GO
with two related problems:
a) absence of annotation does not necessarily mean absence of that GO
function, but maybe just that that particular aspect has never been
for that gene;
b) presence of an annotation does not mean that the gene really has
function, since there are msitakes in the annotation; in fact, GO has
of levels for "quality of annotations" (see
It is my understanding that most tools, right now, just ignore these
am not sure how serious the consequences are, but so far at least our
experience seems to be that results make sense (e.g., see our examples
Of course, this is no excuse. A possible way would be to explicitly
presence and absence of annotation mean, probably making use of the
information contained in the "quality of annotations", within a
framework. M. Battacharjee and I have been working on it (but,
because of my
delays, this is becoming a never-ending project).
> opinions on these approaches? It is
> appearing all over the place in bioinformatics tools like FATIGO,
> DAVID etc. I find that
Yes, several people have had similar ideas. And I think there are a
similar tools around.
> the formal testing approach makes me very uncomfortable, especially
> the biologists I work with tend to over interpret the results.
I don't see your last point: how the formal testing leads to
Centro Nacional de Investigaciones Oncol?gicas (CNIO)
(Spanish National Cancer Center)
Melchor Fern?ndez Almagro, 3
28029 Madrid (Spain)
PGP KeyID: 0xE89B3462