End of the line of GOstats: making sense of the hypergeometric test results now

0

Entering edit mode

Massimo Pinto ▴ 390

@massimo-pinto-3396

Last seen 9.6 years ago

Greetings all, Having first searched the GMane archives, I suppose the following question is appropriate. After selecting my 'entrezUniverse', I have run an hypergeometric test, as implemented in functions provided in GOstats, and thus obtained a readable, hyperlinked report containing a list of the ontology nodes that appear to have been significantly implicated, along with p values, odds ratio, number of significantly regulated genes that fall in each listed node, etc. The report is not exactly short, and I am looking for criteria to proceed with the interpretation of the results. Specifically, I am trying to hunt for the most 'interesting' implicated ontology nodes and, to this end, a marker may be useful. Assuming this line of thinking is appropriate and focusing on the first few lines of the report: > GO.df.CM3.ctr1.2.3 GOBPID Pvalue OddsRatio ExpCount Count Size Term 1 GO:0040011 9.322848e-05 2.558205 11.8928490 26 145 locomotion 2 GO:0002376 2.337660e-04 1.887324 28.2147590 47 344 immune system process 3 GO:0007165 2.821193e-04 1.541496 82.4297464 110 1005 signal transduction 4 GO:0006954 2.840421e-04 2.892962 7.3817683 18 90 inflammatory response 5 GO:0051272 4.985200e-04 6.638731 1.5583733 7 19 positive regulation of cell motion 6 GO:0007154 5.866973e-04 1.493138 88.4992004 115 1079 cell communication [...] I do wonder whether the correct marker for my hunt is the p value, or the Odds Ratio, which would rank my list differently. Plus, the ontology nodes containing the largest number of genes (Size, above) may be of too broad scope to reveal the presence of a biological process that is specifically implicated in my experiment. By the same token, ontology nodes with too few genes may not provide convincing evidence of their implication. Put shortly, what's the suggested strategy to proceed? Thank you very much in advance to all of you who will read this post. Yours Massimo

GO GO • 1.1k views

ADD COMMENT • link updated 14.4 years ago by James W. MacDonald 65k • written 14.4 years ago by Massimo Pinto ▴ 390

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 hour ago

United States

Hi Massimo, Massimo Pinto wrote: > Greetings all, > > Having first searched the GMane archives, I suppose the following > question is appropriate. After selecting my 'entrezUniverse', I have > run an hypergeometric test, as implemented in functions provided in > GOstats, and thus obtained a readable, hyperlinked report containing a > list of the ontology nodes that appear to have been significantly > implicated, along with p values, odds ratio, number of significantly > regulated genes that fall in each listed node, etc. > > The report is not exactly short, and I am looking for criteria to > proceed with the interpretation of the results. Specifically, I am > trying to hunt for the most 'interesting' implicated ontology nodes > and, to this end, a marker may be useful. Assuming this line of > thinking is appropriate and focusing on the first few lines of the > report: > >> GO.df.CM3.ctr1.2.3 > > GOBPID Pvalue OddsRatio ExpCount Count Size > Term > 1 GO:0040011 9.322848e-05 2.558205 11.8928490 26 145 > locomotion > 2 GO:0002376 2.337660e-04 1.887324 28.2147590 47 344 > immune system process > 3 GO:0007165 2.821193e-04 1.541496 82.4297464 110 1005 > signal transduction > 4 GO:0006954 2.840421e-04 2.892962 7.3817683 18 90 > inflammatory response > 5 GO:0051272 4.985200e-04 6.638731 1.5583733 7 19 > positive regulation of cell motion > 6 GO:0007154 5.866973e-04 1.493138 88.4992004 115 1079 > cell communication > [...] > > I do wonder whether the correct marker for my hunt is the p value, or > the Odds Ratio, which would rank my list differently. Plus, the > ontology nodes containing the largest number of genes (Size, above) > may be of too broad scope to reveal the presence of a biological > process that is specifically implicated in my experiment. By the same > token, ontology nodes with too few genes may not provide convincing > evidence of their implication. > > Put shortly, what's the suggested strategy to proceed? The strategy depends on your original hypothesis. If the hypothesis was that inflammation should be a factor in your experimental samples, then you should be looking at #4. If there wasn't a hypothesis, then I would tend to look at the more directed terms first. Something like locomotion is so general as to be useless. However, positive regulation of cell motion would probably be a more tractable ontology to explore. Best, Jim > > Thank you very much in advance to all of you who will read this post. > > Yours > Massimo > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

ADD COMMENT • link 14.4 years ago James W. MacDonald 65k

0

Entering edit mode

Hi, two comments: 1) how you interpret the output depends a bit on whether you used conditional=TRUE or FALSE (I don't think you have told us). And which you use depends on what you are trying to achieve. 2) the odds ratio is the size of the effect (if you are more comfortable with gene expression data then think "fold change") and the p-value (as always) tells you how unusual that is under the null hypothesis. You should rank your list by which is most important to you. Robert James W. MacDonald wrote: > Hi Massimo, > > Massimo Pinto wrote: >> Greetings all, >> >> Having first searched the GMane archives, I suppose the following >> question is appropriate. After selecting my 'entrezUniverse', I have >> run an hypergeometric test, as implemented in functions provided in >> GOstats, and thus obtained a readable, hyperlinked report containing a >> list of the ontology nodes that appear to have been significantly >> implicated, along with p values, odds ratio, number of significantly >> regulated genes that fall in each listed node, etc. >> >> The report is not exactly short, and I am looking for criteria to >> proceed with the interpretation of the results. Specifically, I am >> trying to hunt for the most 'interesting' implicated ontology nodes >> and, to this end, a marker may be useful. Assuming this line of >> thinking is appropriate and focusing on the first few lines of the >> report: >> >>> GO.df.CM3.ctr1.2.3 >> >> GOBPID Pvalue OddsRatio ExpCount Count Size >> Term >> 1 GO:0040011 9.322848e-05 2.558205 11.8928490 26 145 >> locomotion >> 2 GO:0002376 2.337660e-04 1.887324 28.2147590 47 344 >> immune system process >> 3 GO:0007165 2.821193e-04 1.541496 82.4297464 110 1005 >> signal transduction >> 4 GO:0006954 2.840421e-04 2.892962 7.3817683 18 90 >> inflammatory response >> 5 GO:0051272 4.985200e-04 6.638731 1.5583733 7 19 >> positive regulation of cell motion >> 6 GO:0007154 5.866973e-04 1.493138 88.4992004 115 1079 >> cell communication >> [...] >> >> I do wonder whether the correct marker for my hunt is the p value, or >> the Odds Ratio, which would rank my list differently. Plus, the >> ontology nodes containing the largest number of genes (Size, above) >> may be of too broad scope to reveal the presence of a biological >> process that is specifically implicated in my experiment. By the same >> token, ontology nodes with too few genes may not provide convincing >> evidence of their implication. >> >> Put shortly, what's the suggested strategy to proceed? > > The strategy depends on your original hypothesis. If the hypothesis was > that inflammation should be a factor in your experimental samples, then > you should be looking at #4. > > If there wasn't a hypothesis, then I would tend to look at the more > directed terms first. Something like locomotion is so general as to be > useless. However, positive regulation of cell motion would probably be a > more tractable ontology to explore. > > Best, > > Jim > > >> >> Thank you very much in advance to all of you who will read this post. >> >> Yours >> Massimo >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 14.4 years ago rgentleman ★ 5.5k

0

Entering edit mode

Thank you to both. @Robert: I have run the hyperTest both conditionally and not and noticed that the effect is rather substantial: the list of significantly implicated nodes does get shorter when conditional=TRUE. @James: how do you tell a more general node from a less general? Do you merely count the gene size of each or do you look at other factors, for example from the ontology tree? Thank you Massimo Massimo Pinto Post Doctoral Research Fellow Enrico Fermi Centre and Italian Public Health Research Institute (ISS), Rome http://claimid.com/massimopinto On Wed, Nov 25, 2009 at 3:57 PM, Robert Gentleman <rgentlem@fhcrc.org>wrote: > Hi, > two comments: > 1) how you interpret the output depends a bit on whether you used > conditional=TRUE or FALSE (I don't think you have told us). And which you > use depends on what you are trying to achieve. > > 2) the odds ratio is the size of the effect (if you are more comfortable > with gene expression data then think "fold change") and the p-value (as > always) tells you how unusual that is under the null hypothesis. You should > rank your list by which is most important to you. > > Robert > > > James W. MacDonald wrote: > >> Hi Massimo, >> >> Massimo Pinto wrote: >> >>> Greetings all, >>> >>> Having first searched the GMane archives, I suppose the following >>> question is appropriate. After selecting my 'entrezUniverse', I have >>> run an hypergeometric test, as implemented in functions provided in >>> GOstats, and thus obtained a readable, hyperlinked report containing a >>> list of the ontology nodes that appear to have been significantly >>> implicated, along with p values, odds ratio, number of significantly >>> regulated genes that fall in each listed node, etc. >>> >>> The report is not exactly short, and I am looking for criteria to >>> proceed with the interpretation of the results. Specifically, I am >>> trying to hunt for the most 'interesting' implicated ontology nodes >>> and, to this end, a marker may be useful. Assuming this line of >>> thinking is appropriate and focusing on the first few lines of the >>> report: >>> >>> GO.df.CM3.ctr1.2.3 >>>> >>> >>> GOBPID Pvalue OddsRatio ExpCount Count Size >>> Term >>> 1 GO:0040011 9.322848e-05 2.558205 11.8928490 26 145 >>> locomotion >>> 2 GO:0002376 2.337660e-04 1.887324 28.2147590 47 344 >>> immune system process >>> 3 GO:0007165 2.821193e-04 1.541496 82.4297464 110 1005 >>> signal transduction >>> 4 GO:0006954 2.840421e-04 2.892962 7.3817683 18 90 >>> inflammatory response >>> 5 GO:0051272 4.985200e-04 6.638731 1.5583733 7 19 >>> positive regulation of cell motion >>> 6 GO:0007154 5.866973e-04 1.493138 88.4992004 115 1079 >>> cell communication >>> [...] >>> >>> I do wonder whether the correct marker for my hunt is the p value, or >>> the Odds Ratio, which would rank my list differently. Plus, the >>> ontology nodes containing the largest number of genes (Size, above) >>> may be of too broad scope to reveal the presence of a biological >>> process that is specifically implicated in my experiment. By the same >>> token, ontology nodes with too few genes may not provide convincing >>> evidence of their implication. >>> >>> Put shortly, what's the suggested strategy to proceed? >>> >> >> The strategy depends on your original hypothesis. If the hypothesis was >> that inflammation should be a factor in your experimental samples, then you >> should be looking at #4. >> >> If there wasn't a hypothesis, then I would tend to look at the more >> directed terms first. Something like locomotion is so general as to be >> useless. However, positive regulation of cell motion would probably be a >> more tractable ontology to explore. >> >> Best, >> >> Jim >> >> >> >>> Thank you very much in advance to all of you who will read this post. >>> >>> Yours >>> Massimo >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> [[alternative HTML version deleted]]

ADD REPLY • link 14.4 years ago Massimo Pinto ▴ 390

Login before adding your answer.