GSEA preranked analysis downstream of DEseq2
1
1
Entering edit mode
EJ ▴ 10
@ej-11019
Last seen 3 months ago
USA, Boston, Harvard Medical School

Hi,

I have generated the differential expression results for my RNAseq data using DEseq2.  Based on my research, there are two ways of generating the GSEA preranked list: 1) by log2FC; 2) by p value.

Each metrics suffers from certain shortcomings. For example, genes ranked by log2FC are biased by bigger variance in genes with low counts while genes ranked by p value are biased by genes with higher abundance and longer transcripts.

I have a thought - is it possible to weight log2FC by p value or padj and then generate the GSEA ranked list? For example, gene A and B have the same log2FC but has different p values, A with smaller p value and B with bigger p value.  We will then add more weight to gene A than gene B based on their p values.  Does this make sense or completely statistically wrong? If it makes sense, what mathematical formula should be used to perform this transformation using log2FC and p value??

Thank you!!!

deseq2 gsea rnaseq • 6.7k views
3
Entering edit mode
@mikelove
Last seen 20 minutes ago
United States

hi EJ,

"genes ranked by log2FC are biased by bigger variance in genes with low counts"

Note that this is not the case for DESeq2 log fold changes -- a unique property of our using Bayesian posterior estimates for LFC. See the DESeq2 paper or vignette, and examine an MA plot.

"while genes ranked by p value are biased by genes with higher abundance and longer transcripts"

For this consideration, you can use goseq following DESeq2. This method is specially designed to address this problem. There are a few posts on the support site on how to use goseq after DESeq2. I haven't had time to do any comparative analysis on the best methods for gene-set testing after DESeq2. I think goseq is the downstream method that I see most often used.

I do like the idea of methods that use the LFC or t-test, and aggregate across the genes in the set, which allows one to detect, at the level of gene set, when there is an abundance of marginal signal for each gene. I haven't had time to implement something for DESeq2 LFCs, although it's something I'm thinking of.

You might take a look also at the ROAST and CAMERA methods which are available in limma.

0
Entering edit mode

I'm running into the exact same question 5.7 years after the original post. Was wondering if there is an update on the advised practice for gene set enrichment downstream of DESeq2?

Thank you!!

0
Entering edit mode

I use goseq.

0
Entering edit mode

Would using combining both as a ranking metric via log2FC * -log10(p-value) overcome these shortcomings? Or introduce new ones?

1
Entering edit mode

That's fine I guess. I don't know what the meaning of that term is, which is a downside in my opinion.

A posterior effect size is an estimate of an effect, which has a nice interpretation.

0
Entering edit mode

Regarding GSEA preranking metric approaches, what I’ve seen, including in this thread, is that many (or most) seem to do either logFC or sign(logFC) * -log10(pval). Both have disadvantages, because they each only look at one of the two important aspects of DE analysis, the biological change in expression between conditions regardless of significance, or the significance of DE between conditions regardless of biological change. And they each have the biases mentioned in the OP. By multiplying both terms, logFC * -log10(pval), seems to produce a better ranking metric than the individual terms and takes into account both important aspects.