Hi,
I have RNA-seq data from two conditions for which I am trying to identify metabolic differences. I used DESeq2 to calculate log2FCs and thresholded to yield DEGs, which I then input into GO enrichment and GSEA. But not many metabolic gene sets emerged as statistically significant.
I reasoned that a self-contained test might reveal more differences than the competitive tests used by GO/GSEA, and so I decided to give fry a try. I re-processed the RNA-seq data starting from raw counts using the edgeR workflow, and then used fry. But out of ~11,000 gene sets that I tested, about 85% come out as statistically significant (FDR < 0.05). Is this reasonable/expected?
In the literature I have found a few papers that have said that self-contained tests can be too powerful and yield too many significant gene sets, which are not always relevant to the biology of interest. What should I do when a competitive test does not yield many results, and a self-contained test yields too many?
Thanks!
Lets see your code. 8000 significant genesets indicate a flaw somewhere.
Hi ATpoint, I added my code in the comment below. Would appreciate any pointers if you see any glaring errors!