2.0 years ago by
It depends on what you mean by GSEA. That term is generally assumed to mean the specific methods developed at the Broad Institute. There are other methods, such as using Fisher's exact test (based on the hypergeometric distribution) and other gene set tests, which include the so-called competitive and self-contained gene set tests.
It also depends on what sort of data you have. You say microarray data, but then you list goseq, which is intended for RNA-Seq data.
There is really no such thing as 'best', for any statistical method. Instead there are the assumptions you are willing to make, and the hypothesis you are trying to test. And how well you understand a method and will be able to explain to others. That last one often trumps the first two.
The 'conventional' test for a KeGG pathway is probably Fisher's exact, where you are deciding (based on some cutoff) which genes are significant, then further apportioning into the genes that are in the set and those that are not in the set. You then test to see if there are more significant genes that are in the pathway than you would expect by chance. I am not sure you can use topGO for that sort of test, but you can for sure use GOstats.
You can argue that using a cutoff is an arbitrary construct that ignores the genes right below the cutoff. If you want to make that argument, then you want to use either GSEA or a gene set test of some sort. I don't know if there is a 'pure' version of GSEA in BioC, but you can use e.g., the GSEAbase and GSEAlm packages to do gene set testing based on linear models and t-statistics. Or you could use the limma package, which has multiple methods (romer, roast, camera, geneSetTest), that do various different gene set tests, with various underlying assumptions.
The biggest difference (IMO) between the Fisher's exact test and the various gene set tests is that the results from a Fisher's exact 'look more real' to the average biologist as compared to the results of a gene set test. In other words, for the Fisher's exact test you have a set of genes that you are saying are significantly differentially expressed, and then just seeing if a gene set has more significant genes than you would expect by chance. So you are starting out with genes that you (and more importantly, your collaborator or PI) consider truly differentially expressed.
For the other gene set tests you are instead either ranking the genes by a statistic and testing if the gene set is higher or lower in the ranking than you would expect by chance (competitive gene set test), or you are testing that at least one gene in the set is significant (self-contained gene set test). In both cases most if not all of the genes may not actually be significantly differentially expressed. That can be a hard sell - it can be difficult to explain to non-statisticians how a set of genes, most of which aren't differentially expressed, can in aggregate be significant.