gseGO input list
1
0
Entering edit mode
leakerb • 0
@3262be19
Last seen 11 weeks ago
United States

Hi, sorry if this is an overly simple question but I couldn't find a clear answer on the forums or vignette. I'm running gene set enrichment using the gseGO function in ClusterProfiler. The function needs a list of genes, which I'm planning to rank by log fold change. Should the gene list contain all genes, or should it just contain genes below a significance cut off (e.g. padj < 0.05)? I know some people also rank using something like (signed fold change * -log10pvalue). Should that metric use all genes or just below a significance cut off? If both inputs (all genes and padj<0.05) are valid, under what circumstances should you use one over the other?

clusterProfiler • 227 views
2
Entering edit mode
Guido Hooiveld ★ 3.2k
@guido-hooiveld-2020
Last seen 23 hours ago
Wageningen University, Wageningen, the …

For GSEA (FCS) you should use all genes, not a subset. If you use a subset, then you are performing a over-representation (ORA) analysis. For more info on the differences between the methods (FCS vs ORA) you may want to check the links in this post: Cluster profiler - KEGG analysis

The default ranking metric for GSEA is the so-called Signal2Noise metric, but obviously other metrics can be used. FYI: since I use limma for my analyses I standardly use its moderated t-values as ranking metric. For more background / food-for-thought on this see the GSEA website at the Broad Institute (https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideTEXT.htm#_Metrics_for_Ranking), or e.g. this paper.

Also, to perform an ORA (based on Gene Ontology) in clusterProfiler you will need to use the function enrichGO().

0
Entering edit mode