GOstats, geneCounts and gene universe filtering...

0

Entering edit mode

Jesper Ryge ▴ 110

@jesper-ryge-1960

Last seen 10.8 years ago

Hi, Im trying to perform an enrichment analysis for GO terms on my microarray results. my problem arises when i noticed that the geneCount(x) doesnt match the amount of genes annotated at certain nodes using geneIdsByCategory(x). maybe thats ok, i just wondered if that is actually ok or if i missed something? i thought the geneCount was the number of interesting genes (from the list fed to geneIds) that belongs to a particular GO term and that geneIdsByCategory should list those genes, i.e the numbers should match? this turned out not to be the case on at least two of the GO nodes in the list of significant over-represented GO terms: > length(geneIdsByCategory(test)[["GO:0051179"]]) [1] 89 > geneCounts(test)["GO:0051179"] GO:0051179 20 > length(geneIdsByCategory(test)[["GO:0007409"]]) [1] 13 > geneCounts(test)["GO:0007409"] GO:0007409 6 test is the output from hyperGTest(params), a conditional test for over representation on the rat2302 chip. As i said i might have missed something, but it puzzles me somewhat. comments welcome:-) As a "bonus" question i was wondering if there is any consensus regarding filtering the gene universe before doing the GO enrichment analysis? i know its recommended in the GOstats manual, for instance by removing probe sets with little variation across samples using IQR (or some similar measure). but in the topGO package by adrian Alexa they seems to care little about this issue and use all GO annotated probe sets from the chip used in the particular study. i was wondering, if u reduce the set of genes from the gene universe n.GU) dont u also reduce the amount of genes annotated n.GA) to each go term and most likely the amount of interesting genes n.GI) - at least in my case some of the genes thats filtered out by IQR were classified as significantly different?ally expressed by cyberT or limma on the full data set. So what im asking here is: doesn't n.GI and n.GA depend on and change as a function of n.GU? at least when u use coarse grained filtering methods it seems that this is the case and u might loose some interesting genes and in effect throw out the baby with the tub-water - so to speak? put in (yet) another way: the chance at GO node X of getting n.GI [X] interesting genes out of the all annotated genes n.GA[X] at that node by sampling n.GI genes from n.GU at random tells u something about the chance of enrichment at node X. i hope i got that part right? but if n.GI and n.GA depends on n.GU this chance of erinchement might not change drastically when u reduce the gene universe with some coarse grained variance method? or? my preliminary test of filtering versus no filtering seems to show that there is a rather little effect, most of the GO terms are identical in both cases. Does that mean i should trust more those terms that come up in both lists based on either filtered and unfiltered gene universe? or should i prefer one list over the other for some particular reason? it seems to me that the GO terms that are more robust to changes in the gene universe are the most likely candidates? hm, i realise this became a little long. hope i explained it in way that makes sense. sorry if i pose an already discussed issue, but i couldn't seem to find any previous discussions on this. advice and pointers most appreciated:-) cheers, jesper ryge Phd Student, Department of Neuroscience Karolinska Institutet

GO rat2302 probe GOstats topGO GO rat2302 probe GOstats topGO • 2.4k views

ADD COMMENT • link updated 18.1 years ago by Seth Falcon ★ 7.4k • written 18.2 years ago by Jesper Ryge ▴ 110

0

Entering edit mode

Seth Falcon ★ 7.4k

@seth-falcon-992

Last seen 10.8 years ago

Hi Jesper, Jesper Ryge <jesper.ryge at="" ki.se=""> writes: > Im trying to perform an enrichment analysis for GO terms on my > microarray results. my problem arises when i noticed that the > geneCount(x) doesnt match the amount of genes annotated at certain > nodes using geneIdsByCategory(x). maybe thats ok, i just wondered if > that is actually ok or if i missed something? i thought the geneCount > was the number of interesting genes (from the list fed to geneIds) > that belongs to a particular GO term and that geneIdsByCategory > should list those genes, i.e the numbers should match? this turned > out not to be the case on at least two of the GO nodes in the list of > significant over-represented GO terms: > > > length(geneIdsByCategory(test)[["GO:0051179"]]) > [1] 89 > > geneCounts(test)["GO:0051179"] > GO:0051179 > 20 > > length(geneIdsByCategory(test)[["GO:0007409"]]) > [1] 13 > > geneCounts(test)["GO:0007409"] > GO:0007409 > 6 > test is the output from hyperGTest(params), a conditional test for > over representation on the rat2302 chip. > > As i said i might have missed something, but it puzzles me somewhat. > comments welcome:-) This doesn't look right to me either. Can you please send your sessionInfo() so I'm certain what versions of things you are using? I suspect there is a bug in how these functions handle the conditional case. > As a "bonus" question i was wondering if there is any consensus > regarding filtering the gene universe before doing the GO enrichment > analysis? i know its recommended in the GOstats manual, for instance > by removing probe sets with little variation across samples using IQR > (or some similar measure). but in the topGO package by adrian Alexa > they seems to care little about this issue and use all GO annotated > probe sets from the chip used in the particular study. Perhaps that answers your question: there is not widespread consensus. > i was wondering, if u reduce the set of genes from the gene universe > n.GU) dont u also reduce the amount of genes annotated n.GA) to > each go term and most likely the amount of interesting genes n.GI) I think of the filtering process as part of the definition of "interesting gene". So a gene that doesn't pass the non-specific filtering is by definition not interesting and doesn't make it into the selected gene list. Yes, non-specific filtering will reduce the set of genes annotated at some GO terms, but this is desired IMO. > - at least in my case some of the genes thats filtered out by IQR > were classified as significantly different?ally expressed by cyberT > or limma on the full data set. So what im asking here is: doesn't > n.GI and n.GA depend on and change as a function of n.GU? at least > when u use coarse grained filtering methods it seems that this is > the case and u might loose some interesting genes and in effect > throw out the baby with the tub-water - so to speak? > > put in (yet) another way: the chance at GO node X of getting n.GI > [X] interesting genes out of the all annotated genes n.GA[X] at that > node by sampling n.GI genes from n.GU at random tells u something > about the chance of enrichment at node X. i hope i got that part > right? but if n.GI and n.GA depends on n.GU this chance of > erinchement might not change drastically when u reduce the gene > universe with some coarse grained variance method? or? I think you are on the right track. Filtering should change the results, otherwise, why would you filter? The question at hand is whether it is appropriate to include all genes annotated at a given GO term when testing that term. There is consensus (I hope) that genes that were not tested in the experiment should be removed. Non-specific filtering gives you a chance to remove additional genes that were tested, but appear to provide no information about the samples. My experience is that you get more conservative results by reducing the gene universe as much as possible. If you play with phyper a bit, I suspect you will come to a similar conclusion. + seth -- Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center http://bioconductor.org

ADD COMMENT • link 18.2 years ago Seth Falcon ★ 7.4k

0

Entering edit mode

thanks for the fast answer:-) its nice to know im battling my way in the right direction... here is the session info u requested (im using a mac powerPC G4 with mac Os 10.4.9 if thats of any help...) : > sessionInfo() R version 2.5.0 (2007-04-23) powerpc-apple-darwin8.9.1 locale: C attached base packages: [1] "splines" "tools" "stats" "graphics" "grDevices" "utils" [7] "datasets" "methods" "base" other attached packages: topGO SparseM GOstats Category Matrix KEGG "1.2.0" "0.72" "2.2.0" "2.2.2" "0.9975-11" "1.16.0" RBGL GO affy affyio rat2302 Rgraphviz "1.12.0" "1.16.0" "1.14.0" "1.4.0" "1.16.0" "1.14.0" geneplotter lattice graph xtable RColorBrewer genefilter "1.14.0" "0.15-4" "1.14.0" "1.4-3" "0.2-3" "1.14.1" survival annotate Biobase "2.31" "1.14.1" "1.14.0" On 10 May 2007, at 16:53, Seth Falcon wrote: > Hi Jesper, > > Jesper Ryge <jesper.ryge at="" ki.se=""> writes: >> Im trying to perform an enrichment analysis for GO terms on my >> microarray results. my problem arises when i noticed that the >> geneCount(x) doesnt match the amount of genes annotated at certain >> nodes using geneIdsByCategory(x). maybe thats ok, i just wondered if >> that is actually ok or if i missed something? i thought the geneCount >> was the number of interesting genes (from the list fed to geneIds) >> that belongs to a particular GO term and that geneIdsByCategory >> should list those genes, i.e the numbers should match? this turned >> out not to be the case on at least two of the GO nodes in the list of >> significant over-represented GO terms: >> >>> length(geneIdsByCategory(test)[["GO:0051179"]]) >> [1] 89 >>> geneCounts(test)["GO:0051179"] >> GO:0051179 >> 20 >>> length(geneIdsByCategory(test)[["GO:0007409"]]) >> [1] 13 >>> geneCounts(test)["GO:0007409"] >> GO:0007409 >> 6 >> test is the output from hyperGTest(params), a conditional test for >> over representation on the rat2302 chip. >> >> As i said i might have missed something, but it puzzles me somewhat. >> comments welcome:-) > > This doesn't look right to me either. Can you please send your > sessionInfo() so I'm certain what versions of things you are using? I > suspect there is a bug in how these functions handle the conditional > case. > >> As a "bonus" question i was wondering if there is any consensus >> regarding filtering the gene universe before doing the GO enrichment >> analysis? i know its recommended in the GOstats manual, for instance >> by removing probe sets with little variation across samples using IQR >> (or some similar measure). but in the topGO package by adrian Alexa >> they seems to care little about this issue and use all GO annotated >> probe sets from the chip used in the particular study. > > Perhaps that answers your question: there is not widespread consensus. > >> i was wondering, if u reduce the set of genes from the gene universe >> n.GU) dont u also reduce the amount of genes annotated n.GA) to >> each go term and most likely the amount of interesting genes n.GI) > > I think of the filtering process as part of the definition of > "interesting gene". So a gene that doesn't pass the non-specific > filtering is by definition not interesting and doesn't make it into > the selected gene list. > > Yes, non-specific filtering will reduce the set of genes annotated at > some GO terms, but this is desired IMO. > >> - at least in my case some of the genes thats filtered out by IQR >> were classified as significantly different?ally expressed by cyberT >> or limma on the full data set. So what im asking here is: doesn't >> n.GI and n.GA depend on and change as a function of n.GU? at least >> when u use coarse grained filtering methods it seems that this is >> the case and u might loose some interesting genes and in effect >> throw out the baby with the tub-water - so to speak? >> >> put in (yet) another way: the chance at GO node X of getting n.GI >> [X] interesting genes out of the all annotated genes n.GA[X] at that >> node by sampling n.GI genes from n.GU at random tells u something >> about the chance of enrichment at node X. i hope i got that part >> right? but if n.GI and n.GA depends on n.GU this chance of >> erinchement might not change drastically when u reduce the gene >> universe with some coarse grained variance method? or? > > I think you are on the right track. Filtering should change the > results, otherwise, why would you filter? The question at hand is > whether it is appropriate to include all genes annotated at a given GO > term when testing that term. There is consensus (I hope) that genes > that were not tested in the experiment should be removed. > Non-specific filtering gives you a chance to remove additional genes > that were tested, but appear to provide no information about the > samples. My experience is that you get more conservative results by > reducing the gene universe as much as possible. If you play with > phyper a bit, I suspect you will come to a similar conclusion. > > + seth > > -- > Seth Falcon | Computational Biology | Fred Hutchinson Cancer > Research Center > http://bioconductor.org

ADD REPLY • link 18.2 years ago Jesper Ryge ▴ 110

0

Entering edit mode

Seth Falcon ★ 7.4k

@seth-falcon-992

Last seen 10.8 years ago

Jesper Ryge <jesper.ryge at="" ki.se=""> writes: > thanks for the fast answer:-) its nice to know im battling my way in > the right direction... I believe I have found and fixed the bug causing the discrepancy in counts for conditional hyperGTests. The problem was that one of the functions was consulting the gene universe, not the _conditional_ gene universe. The new versions for the release are: Category 2.2.3 GOstats 2.2.2 They should be available in the repository by Monday. + seth -- Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center http://bioconductor.org

ADD COMMENT • link 18.2 years ago Seth Falcon ★ 7.4k

0

Entering edit mode

its works:-) one more question regarding GOstats.-) in ur description of the GOstats package u mention that the conditional test is similar to that presented in alexa et al 2006. would that be like the elim or weight function they describe? i tried compare GOstats and topGO (alexa GO analysis package) and they produce similiar outputs though not identical. i wonder if the differences are due to the fact that i feed entrez IDs into the GOstats package and affy IDs into the topGO package, so they are not based on entirely the same set of genes IDs? or do the statistical method between the two vary? its not so clear for me from the GOstats description exactly what u did in this conditional test? i could have missed something, so if its described somewhere in more detail a pointer to that would just dandy:-) then lastly, these system biology analysis tools for microarray data seems very helpfull, like the GO enrichment analysis of GOstats and topGO. But i relise that a lot of genes are not annotated with GO terms and i wonder how much im actaully missing by this incomplete annotation of genes. it becomes even "worse" for KEGG where less genes are annotated and the amount of significant KEGG pathways that comes out of the GOstats analysis are few. what is ur experience with these kinds of analysis? how far can u push conclusions based on these types of analysis? i have also seen private companies offering curated protein-protein interaction databases to conduct similar analysis. does that bring something new to the picture? i mean that type of network describes a different way of linking genes into nodes and edges perhaps more similar to KEGG than GO. but do they inlcude more genes than f.ex. KEGG and are they worth the investment so to speak - to get acces i mean? and also analysis based on promotor analysis (ex. cartharius et al, 2005, bioinformatics) in the search for common promotors and hence common transcription factor regulation which creates yet another network of transcriptional regulation. these both seem like interesting analysis methods but are there any implementations of such tools for R and bioconductor - with acces to protein interaction databases or promotor sequence/location databases? im not too familiar with these tools but im trying to figure out where to focus my efforts to get maximum information out of my microarray data. i like the network approach and the "holistic" perspective of gene expression and regulation, but unfortunately im not too knowledgeable about the available tools for this kind of analysis nor the possible pitfalls these types of analysis might be "hiding" and one should be aware of. any hints, links, pointers, comment or sharing of experience would be most welcome:-) cheers, jesper ryge Phd Student, Department of Neuroscience Karolinska Institutet On 11 May 2007, at 18:53, Seth Falcon wrote: > Jesper Ryge <jesper.ryge at="" ki.se=""> writes: > >> thanks for the fast answer:-) its nice to know im battling my way in >> the right direction... > > I believe I have found and fixed the bug causing the discrepancy in > counts for conditional hyperGTests. The problem was that one of the > functions was consulting the gene universe, not the _conditional_ gene > universe. > > The new versions for the release are: > > Category 2.2.3 > GOstats 2.2.2 > > They should be available in the repository by Monday. > > + seth > > -- > Seth Falcon | Computational Biology | Fred Hutchinson Cancer > Research Center > http://bioconductor.org > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/ > gmane.science.biology.informatics.conductor

ADD REPLY • link 18.1 years ago Jesper Ryge ▴ 110

0

Entering edit mode

Seth Falcon ★ 7.4k

@seth-falcon-992

Last seen 10.8 years ago

Jesper Ryge <jesper.ryge at="" ki.se=""> writes: > its works:-) Glad to hear it. > one more question regarding GOstats.-) in ur description of the > GOstats package u mention that the conditional test is similar to > that presented in alexa et al 2006. would that be like the elim or > weight function they describe? i tried compare GOstats and topGO > (alexa GO analysis package) and they produce similiar outputs though > not identical. i wonder if the differences are due to the fact that > i feed entrez IDs into the GOstats package and affy IDs into the > topGO package, so they are not based on entirely the same set of > genes IDs? or do the statistical method between the two vary? its not > so clear for me from the GOstats description exactly what u did in > this conditional test? i could have missed something, so if its > described somewhere in more detail a pointer to that would just > dandy:-) The methods are similar, but were developed independently. So I would hope that the results are similar. I would be rather surprised if they were identical. Did you find our article in Bioinformatics? It has a description of the conditional computation done in GOstats. The reference is: S Falcon and R Gentleman. Using GOstats to test gene lists for GO term association. Bioinformatics, 23(2):257-8, 2007. If that isn't enough, I can try to give further details... > then lastly, these system biology analysis tools for microarray data > seems very helpfull, like the GO enrichment analysis of GOstats and > topGO. But i relise that a lot of genes are not annotated with GO > terms and i wonder how much im actaully missing by this incomplete > annotation of genes. it becomes even "worse" for KEGG where less > genes are annotated and the amount of significant KEGG pathways that > comes out of the GOstats analysis are few. what is ur experience with > these kinds of analysis? how far can u push conclusions based on > these types of analysis? I think it is important to remember that annotation sources like GO and KEGG are not complete. So I would suggest not pushing such analysis too far ;-) [sorry, perhaps someone else will have a better answer for you] > i have also seen private companies offering curated protein-protein > interaction databases to conduct similar analysis. does that bring > something new to the picture? i mean that type of network describes a > different way of linking genes into nodes and edges perhaps more > similar to KEGG than GO. but do they inlcude more genes than f.ex. > KEGG and are they worth the investment so to speak - to get acces i > mean? and also analysis based on promotor analysis (ex. cartharius et > al, 2005, bioinformatics) in the search for common promotors and > hence common transcription factor regulation which creates yet > another network of transcriptional regulation. these both seem like > interesting analysis methods but are there any implementations of > such tools for R and bioconductor - with acces to protein interaction > databases or promotor sequence/location databases? > > im not too familiar with these tools but im trying to figure out > where to focus my efforts to get maximum information out of my > microarray data. i like the network approach and the "holistic" > perspective of gene expression and regulation, but unfortunately im > not too knowledgeable about the available tools for this kind of > analysis nor the possible pitfalls these types of analysis might be > "hiding" and one should be aware of. any hints, links, pointers, > comment or sharing of experience would be most welcome:-) I don't have any experience with the proprietary databases. There has been some work on protein interaction data Cf. ppiStats, ScISI. + seth -- Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center http://bioconductor.org

ADD COMMENT • link 18.1 years ago Seth Falcon ★ 7.4k

Login before adding your answer.