Question

Best method/package for Gene Set Enrichment Analysis in microarrays?

1

Entering edit mode

jacorvar ▴ 40

@jacorvar-8972

Last seen 8 months ago

European Union

Dear community,

I recently started working with GSEA of microarray Data in Bioconductor and after a quick search I am quite overwhelmed because the wide supply of different packages to compute GSEA for a given list of differentially expressed probe sets / genes (goseq, topGO, gega, gsea, GOstats...). In many cases, every method claims to be the best, and it's getting hard for me to choose a proper method.

Do you know if there's any benchmark or rather which method is the best one? and for the analysis of KEGG pathways enrichment?

GSEA goseq topgo gega gostats • 4.7k views

ADD COMMENT • link updated 8.7 years ago by Robert Castelo ★ 3.3k • written 8.7 years ago by jacorvar ▴ 40

score 7 · Accepted Answer · 2015-11-18

It depends on what you mean by GSEA. That term is generally assumed to mean the specific methods developed at the Broad Institute. There are other methods, such as using Fisher's exact test (based on the hypergeometric distribution) and other gene set tests, which include the so-called competitive and self-contained gene set tests.

It also depends on what sort of data you have. You say microarray data, but then you list goseq, which is intended for RNA-Seq data.

There is really no such thing as 'best', for any statistical method. Instead there are the assumptions you are willing to make, and the hypothesis you are trying to test. And how well you understand a method and will be able to explain to others. That last one often trumps the first two.

The 'conventional' test for a KeGG pathway is probably Fisher's exact, where you are deciding (based on some cutoff) which genes are significant, then further apportioning into the genes that are in the set and those that are not in the set. You then test to see if there are more significant genes that are in the pathway than you would expect by chance. I am not sure you can use topGO for that sort of test, but you can for sure use GOstats.

You can argue that using a cutoff is an arbitrary construct that ignores the genes right below the cutoff. If you want to make that argument, then you want to use either GSEA or a gene set test of some sort. I don't know if there is a 'pure' version of GSEA in BioC, but you can use e.g., the GSEAbase and GSEAlm packages to do gene set testing based on linear models and t-statistics. Or you could use the limma package, which has multiple methods (romer, roast, camera, geneSetTest), that do various different gene set tests, with various underlying assumptions.

The biggest difference (IMO) between the Fisher's exact test and the various gene set tests is that the results from a Fisher's exact 'look more real' to the average biologist as compared to the results of a gene set test. In other words, for the Fisher's exact test you have a set of genes that you are saying are significantly differentially expressed, and then just seeing if a gene set has more significant genes than you would expect by chance. So you are starting out with genes that you (and more importantly, your collaborator or PI) consider truly differentially expressed.

For the other gene set tests you are instead either ranking the genes by a statistic and testing if the gene set is higher or lower in the ranking than you would expect by chance (competitive gene set test), or you are testing that at least one gene in the set is significant (self-contained gene set test). In both cases most if not all of the genes may not actually be significantly differentially expressed. That can be a hard sell - it can be difficult to explain to non-statisticians how a set of genes, most of which aren't differentially expressed, can in aggregate be significant.

score 5 · Accepted Answer · 2015-11-19

Hi,

In addition to everything that James said I would also recommend reading the following article:

Goeman, J. J., & Bühlmann, P. (2007). Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics, 23(8), 980-987.
http://bioinformatics.oxfordjournals.org/content/23/8/980.long

which makes a useful distiction of methodologies in terms of the null hypothesis being tested:

Competitive null: there are no differences between genes inside and outside the gene set. Examples of this kind of method are the Fisher's exact test or classical GSEA. In both cases, the calculation of the gene-set level statistic involves genes inside and genes outside the gene set. Packages for this kind are GOstats or limma using the camera or romer functions.
Self-contained null: no gene in the gene set is differentially expressed. Note that this is only defined in terms of the genes inside the gene set being tested. Packages for this kind of test are GlobalTest or limma using the roast function.

Under the self-contained null, just one differentially expressed gene makes the gene sets to which it belongs to, also differentially expressed. This may be useful or not depending on your context of analysis. So, in general self-contained tests tend to give more differentially expressed gene sets than competitive tests. Among the competitive tests, those based on the Fisher's exact test require a sizeable input list of differentially expressed genes (I'd say in general at least 50). If that is not your case, then you have definitely to go to GSEA-like methods such as camera, romer or self-contained tests.

Finally, a different avenue is to change your unit of analysis from genes to gene-sets by transforming your input gene-by-sample expression data matrix into a gene-set-by-sample expression data matrix. The package GSVA offers a few different ways to do that and harnesses the BioC infrastructure to do the magic of matching gene-expression features (probe or gene identifiers) into gene set defintions. After this transformation you can analyze gene-sets as if they were genes.

cheers,

robert.