Hypergeometric Testing questions

0

Entering edit mode

Javier Pérez Florido ▴ 840

@javier-perez-florido-3121

Last seen 6.7 years ago

Dear list, I'm using an Hypergeometric Test using hyperGTest from GOstats and Category packages. I have several questions related to this issue: * What is the usual cutoff value used as an input for the hypergeometric test according to the gene set collection used: GO BP, GO MF, GO CC, Chromosome Bands, KEGG and PFAM? * In the nonspecific filtering, I suppose that one can perform different kind of filters depending on the gene set collection used. For example, using the nsFilter function: o For GO BP: nsFilter(OligoEset, require.entrez=TRUE,require.GOBP=TRUE, remove.dupEntrez=TRUE, var.func=IQR,var.cutoff=varCutoff,filterByQuantile=TRUE, feature.exclude="^AFFX") o For GO MF: nsFilter(OligoEset, require.entrez=TRUE,require.GOMF=TRUE, remove.dupEntrez=TRUE, var.func=IQR,var.cutoff=varCutoff,filterByQuantile=TRUE, feature.exclude="^AFFX") o For GO CC: nsFilter(OligoEset, require.entrez=TRUE,require.GOCC=TRUE, remove.dupEntrez=TRUE, var.func=IQR,var.cutoff=varCutoff,filterByQuantile=TRUE, feature.exclude="^AFFX") o For Chromosome Bands: nsFilter(OligoEset, require.entrez=TRUE,require.CytoBand=TRUE, remove.dupEntrez=TRUE, var.func=IQR,var.cutoff=varCutoff,filterByQuantile=TRUE, feature.exclude="^AFFX") o For KEGG: nsFilter(OligoEset, require.entrez=TRUE, remove.dupEntrez=TRUE, var.func=IQR,var.cutoff=varCutoff,filterByQuantile=TRUE, feature.exclude="^AFFX") Therefore, depending on the gene set collection, the filter changes. * Once the Hypergeometric Test is done, I don't understand some of the fields of the HyperGResult object. What I understood is: o ExpCount: the expected number of genes in the selected gene list to be found at each tested category term. o Count: for each category term tested, the number of genes from the interesting gene list that are annotated at the term. o Size: for each category term tested, the number of genes from the universe gene list that are annotated at the term. o OddsRatio: the odds ratio for each category term tested If the test is done for over-represented terms, Count is greater than ExpCount. Otherwise, the test has been performed for under-represented terms. I don't understand the meaning of ExpCount. Expected by who?Is it expected a great difference between ExpCount and Count? Is there a relationship between ExpCount, Count and the p-values? I would like to understand better the meaning of the HyperGResult object according to these fields: ExpCount, Count, Size and OddsRatio. Thanks in advance, Javier [[alternative HTML version deleted]]

GO GOstats Category GO GOstats Category • 1.4k views

ADD COMMENT • link updated 15.0 years ago by Seth Falcon ★ 7.4k • written 15.0 years ago by Javier Pérez Florido ▴ 840

0

Entering edit mode

Seth Falcon ★ 7.4k

@seth-falcon-992

Last seen 10.3 years ago

On 12/9/09 10:25 AM, Javier P?rez Florido wrote: > Dear list, > I'm using an Hypergeometric Test using hyperGTest from GOstats and > Category packages. I have several questions related to this issue: > > * What is the usual cutoff value used as an input for the > hypergeometric test according to the gene set collection used: GO > BP, GO MF, GO CC, Chromosome Bands, KEGG and PFAM? The cutoff value is used to determine significance for a conditional test. For the non-conditional test, the cutoff is only used as a default value in displaying summary results. What you should choose is up to you. If it helps, common values are 0.05 and 0.01. > * In the nonspecific filtering, I suppose that one can perform > different kind of filters depending on the gene set collection > used. For example, using the nsFilter function: > o For GO BP: nsFilter(OligoEset, > require.entrez=TRUE,require.GOBP=TRUE, > remove.dupEntrez=TRUE, > var.func=IQR,var.cutoff=varCutoff,filterByQuantile=TRUE, > feature.exclude="^AFFX") > o For GO MF: nsFilter(OligoEset, > require.entrez=TRUE,require.GOMF=TRUE, > remove.dupEntrez=TRUE, > var.func=IQR,var.cutoff=varCutoff,filterByQuantile=TRUE, > feature.exclude="^AFFX") > o For GO CC: nsFilter(OligoEset, > require.entrez=TRUE,require.GOCC=TRUE, > remove.dupEntrez=TRUE, > var.func=IQR,var.cutoff=varCutoff,filterByQuantile=TRUE, > feature.exclude="^AFFX") > o For Chromosome Bands: nsFilter(OligoEset, > require.entrez=TRUE,require.CytoBand=TRUE, > remove.dupEntrez=TRUE, > var.func=IQR,var.cutoff=varCutoff,filterByQuantile=TRUE, > feature.exclude="^AFFX") > o For KEGG: nsFilter(OligoEset, require.entrez=TRUE, > remove.dupEntrez=TRUE, > var.func=IQR,var.cutoff=varCutoff,filterByQuantile=TRUE, > feature.exclude="^AFFX") > > Therefore, depending on the gene set collection, the filter changes. Yes. > * Once the Hypergeometric Test is done, I don't understand some of > the fields of the HyperGResult object. What I understood is: > o ExpCount: the expected number of genes in the selected gene > list to be found at each tested category term. > o Count: for each category term tested, the number of genes > from the interesting gene list that are annotated at the term. > o Size: for each category term tested, the number of genes > from the universe gene list that are annotated at the term. > o OddsRatio: the odds ratio for each category term tested > > If the test is done for over-represented terms, Count is greater > than ExpCount. Otherwise, the test has been performed for > under-represented terms. I don't understand the meaning of ExpCount. > Expected by who? Is it expected a great difference between ExpCount > and Count? Is there a relationship between ExpCount, Count and the > p-values? I would like to understand better the meaning of the > HyperGResult object according to these fields: ExpCount, Count, Size > and OddsRatio. You might find reading the source code in package Category file R/hyperGTest-methods.R to be helpful. For a given GO ID, the test proceeds by considering an urn containing the genes in the gene universe. Genes that are annotated at our GO ID are white balls in the urn and the rest of the genes are black balls in the urn. We will draw balls from the urn according to the number of genes in the selected gene list. This leads to a 2x2 table like: inGO notGO white black selected n11 n12 not n21 n22 The expected value for n11 is: (n11 + n12) * (n11 + n21) / (n11 + n12 + n21 + n22) If you want more details, take a look at the source code in Category. + seth -- Seth Falcon Program in Computational Biology | Fred Hutchinson Cancer Research Center

ADD COMMENT • link 15.0 years ago Seth Falcon ★ 7.4k

0

Entering edit mode

> > You might find reading the source code in package Category file > R/hyperGTest-methods.R to be helpful. > > For a given GO ID, the test proceeds by considering an urn containing > the genes in the gene universe. Genes that are annotated at our GO ID > are white balls in the urn and the rest of the genes are black balls > in the urn. We will draw balls from the urn according to the number > of genes in the selected gene list. This leads to a 2x2 table like: > > inGO notGO > white black > selected n11 n12 > not n21 n22 > > The expected value for n11 is: > (n11 + n12) * (n11 + n21) / (n11 + n12 + n21 + n22) > > If you want more details, take a look at the source code in Category. > > + seth > Thanks Seth, but looking at the code I'm a little bit confused. Checking the help pages, I would try to explain the meaning of some fields: - ExpCount: the expected number of genes in the selected gene list to be found at each tested category - Count: how many instances of that term were actually observed in the gene list - Size: number that could have been found in the gene list if every instance had turned up. When we are testing for over-representation, Count is greater than Expected Count. What I don't see is why it is important to measure the expected Count. Another question is the relationship between the Expected Count and Count. It has to be small or big for a term being interesting? About Size field, it is the number of genes that could have been found in the interesting gene list if every instance is present. Present where? Thanks again and apologize for these questions, but I it is quite difficult for me to understand the meaning of these fields looking at the code. Javier

ADD REPLY • link 15.0 years ago Javier Pérez Florido ▴ 840

0

Entering edit mode

HI Javier, Here's how I think about it - maybe it will help you. Say your background has 10,000 genes, of which 200 (SIZE) annotate to a particular GO term. If you have 500 genes in your significant list, you would expect to have 200/10,000 = X/500 or X=10 (EXPCOUNT) genes with that GO term if they were randomly sampled. However, in your list of 500 genes, 25 (COUNT) have that GO term. Therefore, the over-expression testing is a sampling probability problem: If you sample 500 genes out of 10,000, of which 200 are term Y, is getting 25 of them more than you would expect due to chance alone? HTH, Jenny At 06:33 AM 12/16/2009, Javier P??rez Florido wrote: >>You might find reading the source code in >>package Category file R/hyperGTest-methods.R to be helpful. >> >>For a given GO ID, the test proceeds by >>considering an urn containing the genes in the >>gene universe. Genes that are annotated at our >>GO ID are white balls in the urn and the rest >>of the genes are black balls in the urn. We >>will draw balls from the urn according to the >>number of genes in the selected gene list. This leads to a 2x2 table like: >> >> inGO notGO >> white black >>selected n11 n12 >>not n21 n22 >> >>The expected value for n11 is: >>(n11 + n12) * (n11 + n21) / (n11 + n12 + n21 + n22) >> >>If you want more details, take a look at the source code in Category. >> >>+ seth > >Thanks Seth, but looking at the code I'm a >little bit confused. Checking the help pages, I >would try to explain the meaning of some fields: >- ExpCount: the expected number of genes in the >selected gene list to be found at each tested category >- Count: how many instances of that term were >actually observed in the gene list >- Size: number that could have been found in the >gene list if every instance had turned up. > > >When we are testing for over-representation, >Count is greater than Expected Count. What I >don't see is why it is important to measure the >expected Count. Another question is the >relationship between the Expected Count and >Count. It has to be small or big for a term being interesting? >About Size field, it is the number of genes that >could have been found in the interesting gene >list if every instance is present. Present where? > >Thanks again and apologize for these questions, >but I it is quite difficult for me to understand >the meaning of these fields looking at the code. >Javier > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Jenny Drnevich, Ph.D. Functional Genomics Bioinformatics Specialist W.M. Keck Center for Comparative and Functional Genomics Roy J. Carver Biotechnology Center University of Illinois, Urbana-Champaign 330 ERML 1201 W. Gregory Dr. Urbana, IL 61801 USA ph: 217-244-7355 fax: 217-265-5066 e-mail: drnevich at illinois.edu

ADD REPLY • link 15.0 years ago Jenny Drnevich ★ 2.0k

Login before adding your answer.