understanding GOstats p-value

0

Entering edit mode

Janet Young ▴ 740

@janet-young-2360

Last seen 4.5 years ago

Fred Hutchinson Cancer Research Center,…

Hi, I have a fairly naive question - I want to make sure I can more or less understand the p-values that GOstats hyperGTest comes out with. Am I right in thinking the p-value is for enrichment of each category individually (i.e. NOT corrected for multiple testing)? I'm analyzing array CGH data so I am testing a lot of categories (my universe is all human genes that have a chromosome position, GO category and entrez ID). Below is an example result - my interpretation is that I shouldn't get super-excited about finding 3 categories with p<0.001 if I've tested 2261 categories (would expect about 2 false positives). Have I understood that correctly? > hgCondOver Gene to GO BP Conditional test for over-representation 2261 GO BP ids tested (3 have p < 0.001) Selected gene set size: 1433 Gene universe size: 12325 Annotation package: org.Hs.eg.db > summary(hgCondOver) GOBPID Pvalue OddsRatio ExpCount Count Size GO:0007156 GO:0007156 0.0001330755 2.470839 12.905720 27 111 GO:0001894 GO:0001894 0.0007587546 5.553301 2.209087 8 19 GO:0007600 GO:0007600 0.0009353695 1.446591 74.062556 100 637 Term GO:0007156 homophilic cell adhesion GO:0001894 tissue homeostasis GO:0007600 sensory perception thanks very much, Janet Young ------------------------------------------------------------------- Dr. Janet Young (Trask lab) Fred Hutchinson Cancer Research Center 1100 Fairview Avenue N., C3-168, P.O. Box 19024, Seattle, WA 98109-1024, USA. tel: (206) 667 1471 fax: (206) 667 6524 email: jayoung at fhcrc.org http://www.fhcrc.org/labs/trask/

Annotation GO Cancer CGH GOstats Annotation GO Cancer CGH GOstats • 1.6k views

ADD COMMENT • link updated 16.3 years ago by Charles Berry ▴ 290 • written 16.3 years ago by Janet Young ▴ 740

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 4 hours ago

United States

Hi Janet, Interpreting p-values for the hypergeometric test is not straightforward. One of the underlying assumptions of the hypergeometric is that the individual things being chosen are independent (think balls in an urn). Unfortunately, this is not true of genes or GO terms. There are at least two types of dependence here. First, the expression of genes is not independent -- one gene can affect the expression of another. Second, the GO terms are set up as a directed acyclic graph, with child terms being subsets of the parent terms, so there is another level of dependence. You can use the conditional test to help limit this second level of dependence, but there isn't too much you can do about the first. Because of this unknown dependence structure it is difficult to do any multiple testing correction for the hypergeometric for a single comparison, not to mention multiple comparisons. One thing I have done in the past for a single comparison is to do a monte carlo resampling in which you randomly select n 'differentially expressed' genes (where n is the number of observed differentially expressed genes that you have observed) and then see how many significant GO terms you get. Do this say 500 or 1000 times, and you will know how many terms you expect to see by chance alone, which gives you an estimate of the number of false positives in your observed results. Unfortunately, this is very time consuming, and I'm not sure if you could scale to multiple comparisons. However, if you just have a small number of terms significant, it shouldn't be too difficult to do downstream validation of that result. Best, Jim Janet Young wrote: > Hi, > > I have a fairly naive question - I want to make sure I can more or > less understand the p-values that GOstats hyperGTest comes out with. > Am I right in thinking the p-value is for enrichment of each category > individually (i.e. NOT corrected for multiple testing)? > > I'm analyzing array CGH data so I am testing a lot of categories (my > universe is all human genes that have a chromosome position, GO > category and entrez ID). Below is an example result - my > interpretation is that I shouldn't get super-excited about finding 3 > categories with p<0.001 if I've tested 2261 categories (would expect > about 2 false positives). Have I understood that correctly? > > > hgCondOver > Gene to GO BP Conditional test for over-representation > 2261 GO BP ids tested (3 have p < 0.001) > Selected gene set size: 1433 > Gene universe size: 12325 > Annotation package: org.Hs.eg.db > > summary(hgCondOver) > GOBPID Pvalue OddsRatio ExpCount Count Size > GO:0007156 GO:0007156 0.0001330755 2.470839 12.905720 27 111 > GO:0001894 GO:0001894 0.0007587546 5.553301 2.209087 8 19 > GO:0007600 GO:0007600 0.0009353695 1.446591 74.062556 100 637 > Term > GO:0007156 homophilic cell adhesion > GO:0001894 tissue homeostasis > GO:0007600 sensory perception > > thanks very much, > > Janet Young > > ------------------------------------------------------------------- > > Dr. Janet Young (Trask lab) > > Fred Hutchinson Cancer Research Center > 1100 Fairview Avenue N., C3-168, > P.O. Box 19024, Seattle, WA 98109-1024, USA. > > tel: (206) 667 1471 fax: (206) 667 6524 > email: jayoung at fhcrc.org > > http://www.fhcrc.org/labs/trask/ > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, MS Biostatistician UMCCC cDNA and Affymetrix Core University of Michigan 1500 E Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623

ADD COMMENT • link 16.3 years ago James W. MacDonald 65k

0

Entering edit mode

James MacDonald wrote: > Hi Janet, > > Interpreting p-values for the hypergeometric test is not > straightforward. One of the underlying assumptions of the hypergeometric > is that the individual things being chosen are independent (think balls > in an urn). Unfortunately, this is not true of genes or GO terms. > > There are at least two types of dependence here. First, the expression > of genes is not independent -- one gene can affect the expression of > another. Second, the GO terms are set up as a directed acyclic graph, > with child terms being subsets of the parent terms, so there is another > level of dependence. You can use the conditional test to help limit this > second level of dependence, but there isn't too much you can do about > the first. > > Because of this unknown dependence structure it is difficult to do any > multiple testing correction for the hypergeometric for a single > comparison, not to mention multiple comparisons. One thing I have done > in the past for a single comparison is to do a monte carlo resampling in > which you randomly select n 'differentially expressed' genes (where n is > the number of observed differentially expressed genes that you have > observed) and then see how many significant GO terms you get. Do this > say 500 or 1000 times, and you will know how many terms you expect to > see by chance alone, which gives you an estimate of the number of false > positives in your observed results. Unfortunately, this is very time > consuming, and I'm not sure if you could scale to multiple comparisons. And I should note that this _still_ doesn't take the inter-gene dependence into account. > > However, if you just have a small number of terms significant, it > shouldn't be too difficult to do downstream validation of that result. > > Best, > > Jim > > > Janet Young wrote: >> Hi, >> >> I have a fairly naive question - I want to make sure I can more or >> less understand the p-values that GOstats hyperGTest comes out with. >> Am I right in thinking the p-value is for enrichment of each category >> individually (i.e. NOT corrected for multiple testing)? >> >> I'm analyzing array CGH data so I am testing a lot of categories (my >> universe is all human genes that have a chromosome position, GO >> category and entrez ID). Below is an example result - my >> interpretation is that I shouldn't get super-excited about finding 3 >> categories with p<0.001 if I've tested 2261 categories (would expect >> about 2 false positives). Have I understood that correctly? >> >> > hgCondOver >> Gene to GO BP Conditional test for over-representation >> 2261 GO BP ids tested (3 have p < 0.001) >> Selected gene set size: 1433 >> Gene universe size: 12325 >> Annotation package: org.Hs.eg.db >> > summary(hgCondOver) >> GOBPID Pvalue OddsRatio ExpCount Count Size >> GO:0007156 GO:0007156 0.0001330755 2.470839 12.905720 27 111 >> GO:0001894 GO:0001894 0.0007587546 5.553301 2.209087 8 19 >> GO:0007600 GO:0007600 0.0009353695 1.446591 74.062556 100 637 >> Term >> GO:0007156 homophilic cell adhesion >> GO:0001894 tissue homeostasis >> GO:0007600 sensory perception >> >> thanks very much, >> >> Janet Young >> >> ------------------------------------------------------------------- >> >> Dr. Janet Young (Trask lab) >> >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Avenue N., C3-168, >> P.O. Box 19024, Seattle, WA 98109-1024, USA. >> >> tel: (206) 667 1471 fax: (206) 667 6524 >> email: jayoung at fhcrc.org >> >> http://www.fhcrc.org/labs/trask/ >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- James W. MacDonald, MS Biostatistician UMCCC cDNA and Affymetrix Core University of Michigan 1500 E Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623

ADD REPLY • link 16.3 years ago James W. MacDonald 65k

0

Entering edit mode

All this is certainly true. However, it is not clear that the dependence makes any real qualitative difference in the results you get. See, for example Gold et al, "Enrichment analysis in high-throughput genomics - accounting for dependency in the NULL", Brief Bioinform 2007; 8:71-77 where we explicitly worked out the implications (for the distribution) of the dependence between pairs of GO categories and checked some actual data sets to see how much things changed. kevin James MacDonald wrote: > Hi Janet, > > Interpreting p-values for the hypergeometric test is not > straightforward. One of the underlying assumptions of the hypergeometric > is that the individual things being chosen are independent (think balls > in an urn). Unfortunately, this is not true of genes or GO terms. > > There are at least two types of dependence here. First, the expression > of genes is not independent -- one gene can affect the expression of > another. Second, the GO terms are set up as a directed acyclic graph, > with child terms being subsets of the parent terms, so there is another > level of dependence. You can use the conditional test to help limit this > second level of dependence, but there isn't too much you can do about > the first. > > Because of this unknown dependence structure it is difficult to do any > multiple testing correction for the hypergeometric for a single > comparison, not to mention multiple comparisons. One thing I have done > in the past for a single comparison is to do a monte carlo resampling in > which you randomly select n 'differentially expressed' genes (where n is > the number of observed differentially expressed genes that you have > observed) and then see how many significant GO terms you get. Do this > say 500 or 1000 times, and you will know how many terms you expect to > see by chance alone, which gives you an estimate of the number of false > positives in your observed results. Unfortunately, this is very time > consuming, and I'm not sure if you could scale to multiple comparisons. > > However, if you just have a small number of terms significant, it > shouldn't be too difficult to do downstream validation of that result. > > Best, > > Jim > > > Janet Young wrote: >> Hi, >> >> I have a fairly naive question - I want to make sure I can more or >> less understand the p-values that GOstats hyperGTest comes out with. >> Am I right in thinking the p-value is for enrichment of each category >> individually (i.e. NOT corrected for multiple testing)? >> >> I'm analyzing array CGH data so I am testing a lot of categories (my >> universe is all human genes that have a chromosome position, GO >> category and entrez ID). Below is an example result - my >> interpretation is that I shouldn't get super-excited about finding 3 >> categories with p<0.001 if I've tested 2261 categories (would expect >> about 2 false positives). Have I understood that correctly? >> >> > hgCondOver >> Gene to GO BP Conditional test for over-representation >> 2261 GO BP ids tested (3 have p < 0.001) >> Selected gene set size: 1433 >> Gene universe size: 12325 >> Annotation package: org.Hs.eg.db >> > summary(hgCondOver) >> GOBPID Pvalue OddsRatio ExpCount Count Size >> GO:0007156 GO:0007156 0.0001330755 2.470839 12.905720 27 111 >> GO:0001894 GO:0001894 0.0007587546 5.553301 2.209087 8 19 >> GO:0007600 GO:0007600 0.0009353695 1.446591 74.062556 100 637 >> Term >> GO:0007156 homophilic cell adhesion >> GO:0001894 tissue homeostasis >> GO:0007600 sensory perception >> >> thanks very much, >> >> Janet Young >> >> ------------------------------------------------------------------- >> >> Dr. Janet Young (Trask lab) >> >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Avenue N., C3-168, >> P.O. Box 19024, Seattle, WA 98109-1024, USA. >> >> tel: (206) 667 1471 fax: (206) 667 6524 >> email: jayoung at fhcrc.org >> >> http://www.fhcrc.org/labs/trask/ >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 16.3 years ago Kevin R. Coombes ▴ 70

0

Entering edit mode

Thanks James and Kevin - that has made things clearer for me. We are also dealing with a third kind of non-independence in our data - array CGH analysis detects large genomic regions of change, and genes of similar function (e.g. large gene families like olfactory receptors) can be clustered in the genome. Because of this, we'd planned to do something similar to your resampling suggestion - simulate multiple sets of genomic regions of the same size distribution as the real data, determine their gene content, and do the GOstats analysis on each of the simulated sets. From what you say, this seems a reasonable approach (although as you point out, it's time-conssuming - I'm already running into problems with how long it takes - I may try distributing this over multiple linux cluster nodes, if I can make that happen relatively easily). On Jan 7, 2008, at 6:40 AM, Kevin R. Coombes wrote: > All this is certainly true. However, it is not clear that the > dependence makes any real qualitative difference in the results you > get. See, for example > > Gold et al, "Enrichment analysis in high-throughput genomics - > accounting for dependency in the NULL", Brief Bioinform 2007; 8:71-77 > > where we explicitly worked out the implications (for the > distribution) of the dependence between pairs of GO categories and > checked some actual data sets to see how much things changed. > > kevin > > James MacDonald wrote: >> Hi Janet, >> Interpreting p-values for the hypergeometric test is not >> straightforward. One of the underlying assumptions of the >> hypergeometric is that the individual things being chosen are >> independent (think balls in an urn). Unfortunately, this is not >> true of genes or GO terms. >> There are at least two types of dependence here. First, the >> expression of genes is not independent -- one gene can affect the >> expression of another. Second, the GO terms are set up as a >> directed acyclic graph, with child terms being subsets of the >> parent terms, so there is another level of dependence. You can use >> the conditional test to help limit this second level of >> dependence, but there isn't too much you can do about the first. >> Because of this unknown dependence structure it is difficult to do >> any multiple testing correction for the hypergeometric for a >> single comparison, not to mention multiple comparisons. One thing >> I have done in the past for a single comparison is to do a monte >> carlo resampling in which you randomly select n 'differentially >> expressed' genes (where n is the number of observed differentially >> expressed genes that you have observed) and then see how many >> significant GO terms you get. Do this say 500 or 1000 times, and >> you will know how many terms you expect to see by chance alone, >> which gives you an estimate of the number of false positives in >> your observed results. Unfortunately, this is very time consuming, >> and I'm not sure if you could scale to multiple comparisons. >> However, if you just have a small number of terms significant, it >> shouldn't be too difficult to do downstream validation of that >> result. >> Best, >> Jim >> Janet Young wrote: >>> Hi, >>> >>> I have a fairly naive question - I want to make sure I can more >>> or less understand the p-values that GOstats hyperGTest comes >>> out with. Am I right in thinking the p-value is for enrichment >>> of each category individually (i.e. NOT corrected for multiple >>> testing)? >>> >>> I'm analyzing array CGH data so I am testing a lot of categories >>> (my universe is all human genes that have a chromosome position, >>> GO category and entrez ID). Below is an example result - my >>> interpretation is that I shouldn't get super-excited about >>> finding 3 categories with p<0.001 if I've tested 2261 categories >>> (would expect about 2 false positives). Have I understood that >>> correctly? >>> >>> > hgCondOver >>> Gene to GO BP Conditional test for over-representation >>> 2261 GO BP ids tested (3 have p < 0.001) >>> Selected gene set size: 1433 >>> Gene universe size: 12325 >>> Annotation package: org.Hs.eg.db >>> > summary(hgCondOver) >>> GOBPID Pvalue OddsRatio ExpCount Count Size >>> GO:0007156 GO:0007156 0.0001330755 2.470839 12.905720 27 111 >>> GO:0001894 GO:0001894 0.0007587546 5.553301 2.209087 8 19 >>> GO:0007600 GO:0007600 0.0009353695 1.446591 74.062556 100 637 >>> Term >>> GO:0007156 homophilic cell adhesion >>> GO:0001894 tissue homeostasis >>> GO:0007600 sensory perception >>> >>> thanks very much, >>> >>> Janet Young >>> >>> ------------------------------------------------------------------- >>> >>> Dr. Janet Young (Trask lab) >>> >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Avenue N., C3-168, >>> P.O. Box 19024, Seattle, WA 98109-1024, USA. >>> >>> tel: (206) 667 1471 fax: (206) 667 6524 >>> email: jayoung at fhcrc.org >>> >>> http://www.fhcrc.org/labs/trask/ >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/ >>> gmane.science.biology.informatics.conductor

ADD REPLY • link 16.3 years ago Janet Young ▴ 740

0

Entering edit mode

Charles Berry ▴ 290

@charles-berry-5754

Last seen 5.1 years ago

United States

Janet Young <jayoung at="" ...=""> writes: > > Hi, > > I have a fairly naive question - I want to make sure I can more or > less understand the p-values that GOstats hyperGTest comes out with. > Am I right in thinking the p-value is for enrichment of each category > individually (i.e. NOT corrected for multiple testing)? Janet, In addition to considering James' replies, you may find the following article (and those cited therein) to be helpful to your understanding of this and related issues. Chuck @article{goeman2007age, title={{Analyzing gene expression data in terms of gene sets: methodological issues}}, author={Goeman, J.J. and Buhlmann, P.}, journal={Bioinformatics}, volume={23}, number={8}, pages={980}, year={2007}, publisher={Oxford Univ Press} } [rest deleted]

ADD COMMENT • link 16.3 years ago Charles Berry ▴ 290

Login before adding your answer.