Question

GSEA for RNA-seq analysis

1

Entering edit mode

imalumberjack ▴ 10

@imalumberjack-15042

Last seen 7.5 years ago

Hello everyone,

I was wondering if anyone could offer some clarity on the appropriate GSEA settings to use with RNA-seq data?

In brief, I have two groups (consisting of n= 17 in group 1 and n= 13 in group 2) I am interested in testing for the enrichment of a signature.

My data has been filtered on a mean absolute deviance cutoff to exclude genes with low variance, and I've used limma (and specifically voomWithQualityWeights) to fit a linear model to my data and generate differentially expressed gene lists.

Additionally, I'd exported the entire dataset to input as a .gct file into GSEA with a .cls phenotype file and analysed with the Signal2Noise ranking metric, but I was reading that using the GSEApreranked might be better? Is this a more valid approach? As I've read in a few places that this might inflate my p values and should only be used under certain circumstances (e.g. low numbers of replicates, https://stat.ethz.ch/pipermail/bioconductor/2014-January/057214.html).

In which case, there appears to be little consensus on the best way to rank my genesets (by p value or by FC?) and I'd very much appreciate some guidance as well...

Kind regards and many thanks, in advance, for your help!

GSEA Limma preranked GSEA tTest • 4.6k views

ADD COMMENT • link updated 7.5 years ago by Gordon Smyth 53k • written 7.5 years ago by imalumberjack ▴ 10

0

Entering edit mode

I recommend either ROAST (from the limma package) or QuSAGE for gene set analysis.

ADD REPLY • link 7.5 years ago chris86 ▴ 420

score 0 · Answer 1 · 2018-08-12

0

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

GSEA isn't a Bioconductor program. If you have questions about how to use it, then you should send them to the GSEA authors or to a GSEA forum. I will make a few comments though:

I've never heard anyone claim that GSEApreranked is better than the standard GSEA methodology, so I don't know where you would have read that.
I don't see how you could export the dataset from voom and input it into GSEA because voom produces precision weights and GSEA can't use precision weights.
It isn't correct to filter by MAD and then do a limma analysis. You should never filter by variance before doing an empirical Bayes analysis.
If you wanted to try the Bioconductor gene set enrichment functionality, then this forum would be right place.

ADD COMMENT • link 7.5 years ago Gordon Smyth 53k

0

Entering edit mode

1. Actually, some time ago I implemented label-permuting GSEA test in fgsea package with a difference of calculating adjusted p-values with BH method, as opposed to ad-hoc NES-based method in Broad's version, and it turned out to be too conservative: I couldn't find a dataset where there were any significan results after multiple hypothesis correction. On the other hand, pre-ranked GSEA works, however requires a caution in result interpretation.

ADD REPLY • link 7.5 years ago alserg ▴ 280

1

Entering edit mode

You seem to be comparing your own modified permutation method to your own modified pre-ranked method, so that doesn't seem very relevant to OP's question, which was about GSEA itself.

It's already known on theoretical grounds that BH can't work with permutation methods. Same goes for any p-value correction algorithm. It's also known that the pre-ranked method doesn't control the error rate, not even remotely, unless you adjust for inter-gene correlations as do camera or QuSAGE.

ADD REPLY • link 7.5 years ago Gordon Smyth 53k

0

Entering edit mode

Dear Gordon,

Without wasting too much time, could you provide an easily comprehensible reference about "p-value correction algorithm can't work with permutation methods"? I am wondering if this applies to the SAM methodology.

Sorry for being out of the scope of the OP.

ADD REPLY • link 7.5 years ago SamGG ▴ 360

3

Entering edit mode

What I mean is that all the p-value adjustment methods require some of the p-values to be very small in order to survive multiple testing adjustment when the number of gene sets is large, and getting very small p-values requires a prohibitively number of permutations.

For example, suppose you are testing the MSigDb C2 collection with about 5000 gene sets. You need the smallest p-values to be about 0.05/5000 or smaller in order to get an FDR below 0.05, and this requires 10^5 permutations. To get a worthwhile number of DE sets, the number of permutations needs to be much larger again, which is prohibitively slow. Even then, all the gene sets with the smallest p-value will be equally ranked because permutation can't resolve small p-values. It's all quite unsatisfying.

The same considerations would apply to SAM or to any permutation method, which is why SAM instead uses a FDR estimate based on the global permutation distribution of the test statistic. SAM is often applied to very small samples sizes, so there will only be a limited number of distinct permutations anyway.

The same sort of considerations also apply to my own mroast() rotation method, which is why we recommend fry() or camera() instead when dealing with large collections of gene sets.

ADD REPLY • link 7.5 years ago Gordon Smyth 53k

0

Entering edit mode

Gordon, could you, please, provide one or two links about p-value correction methods being incompatible with permutation tests? It'd love to read more about it.

ADD REPLY • link 7.4 years ago alserg ▴ 280