Question

Statistic for GSA

0

Entering edit mode

Lluís Revilla Sancho ▴ 730

@lluis-revilla-sancho

Last seen 11 days ago

European Union

There are several packages that implement functions for a Gene Set (Enrichment) Analysis, some of them are limma, gsar, gsva, piano, fgsea, pgsea. However I couldn't find an agreement for those tests which accept any statistic as input: fgsea, runGSA, geneSetTest.

Using plotting functions, like plotEnrichment and barcodeplot, I found that the results can change based on which statistics I use. If I use limma's topTable result as input I have the logFC, t, B P.value, adj.P.value and AveExpr statistics. Are there any recommendations or articles to read about which one is better?

Of course doing the analysis with P.value and adj.P.value will result on similar results but there are noticeable differences when I use logFC or t values.

gsea concept gsa • 2.0k views

ADD COMMENT • link updated 7.3 years ago by Pekka Kohonen ▴ 190 • written 7.3 years ago by Lluís Revilla Sancho ▴ 730

score 0 · Answer 1 · 2017-02-03

0

Entering edit mode

Pekka Kohonen ▴ 190

@pekka-kohonen-5862

Last seen 6.3 years ago

Sweden

It is better to use packages for gene set testing rather than constructing the test yourself (e.g., by using FC or t-statistic with geneSetTest). This is because the "real" gene set tests take into account other properties and considerations that affect the result such as how the gene set p-value is calculated (sample permutation or rotation is generally best or most conservative approach). Tests that implement all of the details properly include camera, romer, roast (mroast) and fry in the limma package. The geneSetTest function should not be used unless you have to (because of a lack of biological replicates or generally replicated data). The GSA package is also OK (uses sample permutation). With other packages functions that use sample permutation should be used.

But to answer your question I would say that a moderated t-statistic from limma, edgeR or DESeq2 would be the best option. It takes into account the magnitude of the change and the uncertainty (variance). The order should be similar to a p-value, but adjustment might result in a different order (some p-values become 1). Also p-values may be one-sided (by default) whereas t-statistics can be one-sided or two-sided (if so desired). This is also what most of the packages mentioned above use.

ADD COMMENT • link 7.3 years ago Pekka Kohonen ▴ 190

0

Entering edit mode

Thanks for your answer, I am aware of the differences and improvements of camera, romer, roast and fry compared to similar methods. But researchers are used to plots from the Broad Institute when they ask for a GSEA, so I would like to show the gene sets found with those functions in a way that they are also appealing when comparing with previous plots.

ADD REPLY • link 7.3 years ago Lluís Revilla Sancho ▴ 730

0

Entering edit mode

Yes, I suppose if you use those tools for the analysis then using the moderated t-statistic would be best. The barplot should use whatever statistic that is used in the GSEA method, I suppose. And those methods generally use the t-statistic (at least the limma ones do).

The fgsea package seems to have a really nice way of visualizing the results, integrating barplots into a results table. And it is a fast implementation of at least some of the Broad Institute methods. I am not quite sure whether it does sample permutations (or sample+gene permutations) but if you were considering of using the geneSetTest, then this would be an improvement over that at least.

ADD REPLY • link 7.3 years ago Pekka Kohonen ▴ 190

0

Entering edit mode

fgsea works downstream of gene ranking, so it's impossible to do sample permutations and only gene permutations are used for the test.

ADD REPLY • link 7.3 years ago alserg ▴ 260

0

Entering edit mode

Thanks! I will probably use it for ssGSEA analysis at least! Is it able to do mixed or absolute gene set statistic analyses? There are some recent papers (Nam D. et al. 2015 and Yoon S. et al. 2016) which show (as far as I can tell) that use of the absolute statistic reduced the inter-gene correlations problem when doing ssGSEA analyses.

References:

Nam D. Effect of the absolute statistic on gene-sampling gene-set analysis methods. Stat Methods Med Res. 2015 Mar 2. pii: 0962280215574014.

Yoon S, Kim SY, Nam D. Improving Gene-Set Enrichment Analysis of RNA-Seq Data with Small Replicates. PLoS One. 2016 Nov 9;11(11):e0165919.)

ADD REPLY • link 7.3 years ago Pekka Kohonen ▴ 190

0

Entering edit mode

Yes, but all the function that perform a GSEA from the limma package don't provide the statistic used, as well as other methods don't provide the statistic, hence my question.

ADD REPLY • link 7.3 years ago Lluís Revilla Sancho ▴ 730

0

Entering edit mode

Hi, as far as I can tell they all use the moderated t-statistic in one way or another (i.e. roast, romer, camera, fry). It is also consistent with the limma philosophy to do so (Ritchie et al. 2015), as they stress the importance of the eBayes information borrowing and the variance trend modelling and so on. To be more sure would have to look at the code itself. But to be exact you would have to make sure that you apply the same parameters to the visualization and the analysis (i.e., if you used trend=TRUE in the camera or the roast/romer then use the same parameter for the visualization).

Maybe use the fgsea package (which is an implementation of the Broad tool) alongside whichever method from limma you prefer (camera and romer would be both competitive methods like the pre-rankedGSEA). And only show them the results that are significant in both.

ADD REPLY • link 7.3 years ago • updated 7.2 years ago Pekka Kohonen ▴ 190