Question: GOstats, geneCounts and gene universe filtering...
12.4 years ago by
Jesper Ryge • 110
Jesper Ryge • 110 wrote:
Hi, Im trying to perform an enrichment analysis for GO terms on my microarray results. my problem arises when i noticed that the geneCount(x) doesnt match the amount of genes annotated at certain nodes using geneIdsByCategory(x). maybe thats ok, i just wondered if that is actually ok or if i missed something? i thought the geneCount was the number of interesting genes (from the list fed to geneIds) that belongs to a particular GO term and that geneIdsByCategory should list those genes, i.e the numbers should match? this turned out not to be the case on at least two of the GO nodes in the list of significant over-represented GO terms: > length(geneIdsByCategory(test)[["GO:0051179"]])  89 > geneCounts(test)["GO:0051179"] GO:0051179 20 > length(geneIdsByCategory(test)[["GO:0007409"]])  13 > geneCounts(test)["GO:0007409"] GO:0007409 6 test is the output from hyperGTest(params), a conditional test for over representation on the rat2302 chip. As i said i might have missed something, but it puzzles me somewhat. comments welcome:-) As a "bonus" question i was wondering if there is any consensus regarding filtering the gene universe before doing the GO enrichment analysis? i know its recommended in the GOstats manual, for instance by removing probe sets with little variation across samples using IQR (or some similar measure). but in the topGO package by adrian Alexa they seems to care little about this issue and use all GO annotated probe sets from the chip used in the particular study. i was wondering, if u reduce the set of genes from the gene universe n.GU) dont u also reduce the amount of genes annotated n.GA) to each go term and most likely the amount of interesting genes n.GI) - at least in my case some of the genes thats filtered out by IQR were classified as significantly different?ally expressed by cyberT or limma on the full data set. So what im asking here is: doesn't n.GI and n.GA depend on and change as a function of n.GU? at least when u use coarse grained filtering methods it seems that this is the case and u might loose some interesting genes and in effect throw out the baby with the tub-water - so to speak? put in (yet) another way: the chance at GO node X of getting n.GI [X] interesting genes out of the all annotated genes n.GA[X] at that node by sampling n.GI genes from n.GU at random tells u something about the chance of enrichment at node X. i hope i got that part right? but if n.GI and n.GA depends on n.GU this chance of erinchement might not change drastically when u reduce the gene universe with some coarse grained variance method? or? my preliminary test of filtering versus no filtering seems to show that there is a rather little effect, most of the GO terms are identical in both cases. Does that mean i should trust more those terms that come up in both lists based on either filtered and unfiltered gene universe? or should i prefer one list over the other for some particular reason? it seems to me that the GO terms that are more robust to changes in the gene universe are the most likely candidates? hm, i realise this became a little long. hope i explained it in way that makes sense. sorry if i pose an already discussed issue, but i couldn't seem to find any previous discussions on this. advice and pointers most appreciated:-) cheers, jesper ryge Phd Student, Department of Neuroscience Karolinska Institutet
ADD COMMENT • link •