I am currently running a simple differential gene expression analysis to identify transcriptomic changes between two condition in ~60 patients. This analysis revealed no significantly differentially expressed genes.
logFC are symmetrically distributed around 0 and range from
adj.P.val ranges from
1 (vast majority is
Nonethless, I decided to run
clusterProfiler::GSEA (using hallmark gene sets from msigdb). For that I ranked the all genes according to the
logFC that I got from running the differential gene expression analysis. To my surprise, there were plenty of significantly enriched terms (p.val<1e-16). I know this can happen and is also one of the "strengths" of GSEA, because it can detect changes at pathway-level, even if the changes in the expression level of individual genes are not significant, but I did not expect so many enriched pathways.
So I decided to run an additional test and ran differential gene expression analysis between a completely random subset of individuals (disregarding the two initial conditions). The
logFC distribution and
adj.P.val distribution looked similar to the "real" differential gene expression analsyis (comparing condition A and B). However, again GSEA showed many significantly enriched pathways. This makes me seriously doubt whether I can trust my GSEA results for "real" differential gene expression analysis.
I would greatly appreciate any insights you have on this topic and also maybe an explanation why this can happen and what I can do to prevent it.
Some more informations to illustrate the issue.
Because basically all adj. P values are one, the volcano plot is sort of meaning less.
Instead I am showing the distribution of
unadjusted P values:
This is how the top enriched pathway looks like. I used the