gseGO input list
1
0
Entering edit mode
leakerb • 0
@3262be19
Last seen 2.4 years ago
United States

Hi, sorry if this is an overly simple question but I couldn't find a clear answer on the forums or vignette. I'm running gene set enrichment using the gseGO function in ClusterProfiler. The function needs a list of genes, which I'm planning to rank by log fold change. Should the gene list contain all genes, or should it just contain genes below a significance cut off (e.g. padj < 0.05)? I know some people also rank using something like (signed fold change * -log10pvalue). Should that metric use all genes or just below a significance cut off? If both inputs (all genes and padj<0.05) are valid, under what circumstances should you use one over the other?

clusterProfiler • 3.1k views
ADD COMMENT
3
Entering edit mode
Guido Hooiveld ★ 3.9k
@guido-hooiveld-2020
Last seen 3 hours ago
Wageningen University, Wageningen, the …

For GSEA (FCS) you should use all genes, not a subset. If you use a subset, then you are performing a over-representation (ORA) analysis. For more info on the differences between the methods (FCS vs ORA) you may want to check the links in this post: Cluster profiler - KEGG analysis

The default ranking metric for GSEA is the so-called Signal2Noise metric, but obviously other metrics can be used. FYI: since I use limma for my analyses I standardly use its moderated t-values as ranking metric. For more background / food-for-thought on this see the GSEA website at the Broad Institute (https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideTEXT.htm#_Metrics_for_Ranking), or e.g. this paper.

Also, to perform an ORA (based on Gene Ontology) in clusterProfiler you will need to use the function enrichGO().

ADD COMMENT
0
Entering edit mode

Very helpful, thank you!

ADD REPLY

Login before adding your answer.

Traffic: 615 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6