Question

can I use FDR correction with hyperGTest conditional GO method?

0

Entering edit mode

Mark W Kimpel ▴ 830

@mark-w-kimpel-2027

Last seen 9.6 years ago

Here's a question for the serious statisticians amongst us. The function hyperGTest of package "GOstats" implements a method similar to Alexa, et. al (2006) (elim method). Alexa, et. al claim that the oft used hypergeometric test on the entire ontology can't be analyzed for FDR because of the highly interdependent nature of the DAG structure of GO. The authors go on claim that their methods decrease this interdependence, but, as far as I can tell, never directly answer the question as to whether the resultant p values can be corrected for FDR. For the purpose of the following discussion, assume that we are only working with one of the 3 major GO categories. While it is true that dependence has been decreased because a parent cannot reverse inherit a gene from its child, several children at the same level can share genes, or can they? I"m not sure. If there is gene overlap at the lowest levels of the GO graph structure, then it seems to me that there is still dependence and FDR cannot be assessed. Correct? if there is no gene overlap at the lowest levels of the GO graph structure, then it seems to me that these levels are independent and FDR can be applied. Correct? Would someone who really knows GO answer the question about overlap of genes at the lowest levels and then could a statistician answer the questions regarding dependence/independence and the applicability of applying an FDR method such as BH or the Storey qvalue? Thanks, Mark -- Mark W. Kimpel MD Neuroinformatics Department of Psychiatry Indiana University School of Medicine

GO GO • 1.6k views

ADD COMMENT • link updated 17.2 years ago by Kevin R. Coombes ▴ 140 • written 17.2 years ago by Mark W Kimpel ▴ 830

score 0 · Answer 1 · 2007-02-12

Hi Mark, There has been a fair amount of discussion of these issues already, searching the mailing list will help to reveal the salient points. The most important question here is what do *you* think p-value correction is going to do for you? In my opinion (and lots of folks seem to have different views), p-value corrections do two things for us. Both are related to the observation that under a composite null (all hypotheses are false), that the smallest p-value when testing 10K hypotheses is much smaller than the smallest p-value when testing 5K. And most of us need some help deciding/interpreting these outputs. 1) If you test some large number of hypotheses, p-value corrections allow you to interpret the p-values in some holistic way. Here one, is trying to answer the question of whether any (or how many) of the hypotheses are truly false. And it sort of works, but basically the "correction", is almost always a reduction in the significance level, and so not only do you enrich the set of "called false" hypotheses for truly false, you also make more errors of the other kind (not rejecting hypotheses that are false). 2) If you have two experiments, one with 5K hypotheses, and one with 10K, then p-value corrections allow you to "align" the evidence, and to compare in some sensible way the two experiments. I am not aware of any other contributions that these methods can make, but perhaps others will enlighten us. Now, when we turn our attention to GO, the problem is not one of p-value correction, but one of philosophy, again in my view. Consider the following situation (which does often arise). Consider two nodes in the GO graph, with a parent child relationship, and further consider a given set of data, where you have some set of tested genes (which define your universe) and some set of genes you have decided are *special*. Next we find, that for these two nodes in the graph, the same set of genes are annotated at both (for all genes in the organism this will not be true, but we didn't measure them all and we only get to work with what we measured). So now, the two p-values from your Hypergeometric test are identical. No amount of p-value correction (or even p-value psychotherapy) will change that. So which node do you report? This is entirely philosophy and not mathematics. Current scientific practice is to report the more specific of the nodes, and to only make more general claims (eg I cured cancer, over less general ones, I cured person X, who had cancer) when there is additional evidence, over and above that needed for the specific claim. That is the point of the conditional analyses. Now of course, a much better way to do the whole thing is to use GSEA (eg the Category package), but then you will eventually end up back at the same place. When you are dealing with dependent hypotheses, there is always going to be a philosophical, not just a mathematical, issue to deal with. best wishes Robert Mark W Kimpel wrote: > Here's a question for the serious statisticians amongst us. > > The function hyperGTest of package "GOstats" implements a method similar > to Alexa, et. al (2006) (elim method). Alexa, et. al claim that the oft > used hypergeometric test on the entire ontology can't be analyzed for > FDR because of the highly interdependent nature of the DAG structure of > GO. The authors go on claim that their methods decrease this > interdependence, but, as far as I can tell, never directly answer the > question as to whether the resultant p values can be corrected for FDR. > > For the purpose of the following discussion, assume that we are only > working with one of the 3 major GO categories. While it is true that > dependence has been decreased because a parent cannot reverse inherit a > gene from its child, several children at the same level can share genes, > or can they? I"m not sure. > > If there is gene overlap at the lowest levels of the GO graph structure, > then it seems to me that there is still dependence and FDR cannot be > assessed. Correct? > > if there is no gene overlap at the lowest levels of the GO graph > structure, then it seems to me that these levels are independent and FDR > can be applied. Correct? > > Would someone who really knows GO answer the question about overlap of > genes at the lowest levels and then could a statistician answer the > questions regarding dependence/independence and the applicability of > applying an FDR method such as BH or the Storey qvalue? > > Thanks, > > Mark > -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org

score 0 · Answer 2 · 2007-02-12

Hi, Take a look at Gold DL, Coombes KR, Wang J, Mallick B. Enrichment analysis in high-throughput genomics--accounting for dependency in the NULL. Brief Bioinform. 2006 Oct 31; [Epub ahead of print] when it comes out. (An earlier version is available as aw tech report on our web site at http://bioinformatics.mdanderson.org) We have looked at how the model needs to be changed to account for pairwise interactions/correlations between categories. One key point is that the relative rankings of the importance of different GO categories does not appear to change very much if you improve the model by accounting for dependence. This does not directly address the FDR question you raised. But it suggests that dependence is actually weaker than you might think, so the usual FDR assessments might be close to correct. Kevin Mark W Kimpel wrote: > Here's a question for the serious statisticians amongst us. > > The function hyperGTest of package "GOstats" implements a method similar > to Alexa, et. al (2006) (elim method). Alexa, et. al claim that the oft > used hypergeometric test on the entire ontology can't be analyzed for > FDR because of the highly interdependent nature of the DAG structure of > GO. The authors go on claim that their methods decrease this > interdependence, but, as far as I can tell, never directly answer the > question as to whether the resultant p values can be corrected for FDR. > > For the purpose of the following discussion, assume that we are only > working with one of the 3 major GO categories. While it is true that > dependence has been decreased because a parent cannot reverse inherit a > gene from its child, several children at the same level can share genes, > or can they? I"m not sure. > > If there is gene overlap at the lowest levels of the GO graph structure, > then it seems to me that there is still dependence and FDR cannot be > assessed. Correct? > > if there is no gene overlap at the lowest levels of the GO graph > structure, then it seems to me that these levels are independent and FDR > can be applied. Correct? > > Would someone who really knows GO answer the question about overlap of > genes at the lowest levels and then could a statistician answer the > questions regarding dependence/independence and the applicability of > applying an FDR method such as BH or the Storey qvalue? > > Thanks, > > Mark >