Question

gseGO input list

0

Entering edit mode

leakerb • 0

@3262be19

Last seen 2.7 years ago

United States

Hi, sorry if this is an overly simple question but I couldn't find a clear answer on the forums or vignette. I'm running gene set enrichment using the gseGO function in ClusterProfiler. The function needs a list of genes, which I'm planning to rank by log fold change. Should the gene list contain all genes, or should it just contain genes below a significance cut off (e.g. padj < 0.05)? I know some people also rank using something like (signed fold change * -log10pvalue). Should that metric use all genes or just below a significance cut off? If both inputs (all genes and padj<0.05) are valid, under what circumstances should you use one over the other?

clusterProfiler • 3.5k views

ADD COMMENT • link 2.7 years ago leakerb • 0

score 3 · Accepted Answer · 2021-11-02

For GSEA (FCS) you should use all genes, not a subset. If you use a subset, then you are performing a over-representation (ORA) analysis. For more info on the differences between the methods (FCS vs ORA) you may want to check the links in this post: Cluster profiler - KEGG analysis

The default ranking metric for GSEA is the so-called Signal2Noise metric, but obviously other metrics can be used. FYI: since I use limma for my analyses I standardly use its moderated t-values as ranking metric. For more background / food-for-thought on this see the GSEA website at the Broad Institute (https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideTEXT.htm#_Metrics_for_Ranking), or e.g. this paper.

Also, to perform an ORA (based on Gene Ontology) in clusterProfiler you will need to use the function enrichGO().