Question

pre-ranked GSEA within R? + Best DESeq2/limma-voom metric?

1

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 18 minutes ago

WEHI, Melbourne, Australia

Hi Jose,

Doing a one-sample t-test of the logFCs for a gene set is very similar to the test proposed by

http://www.ncbi.nlm.nih.gov/pubmed/20048385

(Actually that paper did a z-test of t-statistics, and criticized the test of logFCs as ignoring unequal variances betweeen genes.]

You're right that this is different from GSEA, and in fact it has been criticized directly by the GSEA authors themselves:

http://www.ncbi.nlm.nih.gov/pubmed/23070592

The point I was making in my previous email, is that the criticism made by the GSEA authors of the logFC test applies equally to the GSEA pre-ranked test. So I suspect the GSEA authors would not recommend the pre-ranked test themselves unless there was no alternative (i.e., no replicates).

Best wishes
Gordon

--------- original message ------------ [BioC] pre-ranked GSEA within R? + Best DESeq2/limma-voom metric? Garcia Manteiga Jose Manuel <garciamanteiga.josemanuel at hsr.it> Fri Jan 17 17:52:01 CET 2014 Dear Mike, Thanks for the confirmation, I remember talking to someone during the Bioc2013 lab saying that same thing on shrunken log2FC but I do not know why I thought that the value to use with pre-ranked was not the log2FC in the results object but another one due to the shrinking procedure. On the other hand, on the workflow from the lab, I remember that in 5.1 log2FC was used to test whether particular gene sets had an average log2FC different from 0. That would use log2FC information but in a different way compared to how I understand GSEA pre-ranked analysis (the procedure published by the people in Broad). In the Broad pre-ranked strategy, the enrichment using log2FCs from the top (positive and negative) is giving you an enrichment score that should be more related, in my view, to biological relevance since it is linked to highest log2FCs observed, in absolute values, whereas the workflow in 5.1 would flag those gene sets that do show a different behaviour but maybe very low just because we are mathematically able to say so (because of the t-test used). By the way, could the thresholded test implementation in DESeq2 >1.3, or something similar, be used in a similar way to filter for those gene sets with a more relevant enrichment using the 5.1 workflow? Thanks again for the efforts Jose On 17 Jan 2014, at 16:10, Michael Love <michaelisaiahlove at gmail.com> wrote: hi Garcia, For DESeq2, we do recommend using shrunken log fold changes for gene set tests, that is, the log2FoldChange column of a results object when the default DESeq2 pipeline has been used. For a suggested workflow, see section 5.1 of the PDF for the presentation "DESeq2/DEXSeq / Alejandro Reyes, Wolfgang Huber", one of the Bioc2013 labs: http://bioconductor.org/help/course-materials/2013/BioC2013/ Mike On Fri, Jan 17, 2014 at 3:30 AM, Garcia Manteiga Jose Manuel <garciamanteiga.josemanuel at hsr.it> wrote: Hi, I wanted the same and this is what gsea people replied to me some months ago: Hi, Thank you for your interest in GSEA. R GSEA does accept ranked list as an input. Please note that R GSEA has not been actively maintained since 2005. Regards, I have not gone through the R GSEA again to check how pre-ranked lists can be accepted, but the last sentence in their message made me hold a minute and think in other possibilities. For the moment I am still doing as you mentioned, using the rnk file with java, then getting back again from results of gsea to R. By the way, I would like to known which metric should be best to be used for that kind of analysis when using RNA-Seq data coming from DESeq2 analysis. log2FC? p-values? Are they considered to be weighted, as GSEA pre-ranked names them? Thanks Regards On 17 Jan 2014, at 00:18, Daniel Schmolze <bioconductor at schmolze.com> wrote: I want to do a GSEA entirely from within R, using genes ranked by my own metric. At the moment I'm saving my ranked genes to a .rnk file, then calling Broad's GSEA java program, then reading the resulting output back into R (all I care about are the p-values in gsea_report_for_na_pos_####.xls and gsea_report_for_na_neg_####.xls). Cumbersome to say the least. As far as I can tell, the Broad GSEA R script won't accept pre-ranked genes, but maybe I'm wrong? If not, I'm interested in other options. I'd like to specifically stick with the Broad GSEA algorithm if possible.

DESeq2 • 3.9k views

ADD COMMENT • link 10.3 years ago • updated 21 months ago Gordon Smyth 50k

Gordon Smyth · Answer 1 · 2014-01-20

Dear Gordon, First of all, thank you very much for your replies to our questions. For a non-statistician it is always a pleasure (and a MUST) bearing in mind the actual statistics behind the approaches, explained in a clear way showing its advantages, what they do, what they do not/cannot do, caveats and comparisons. Let me see if I got the point in my words: 1. log2FC (shrunken) would be the one to be used for DESeq2 and other RNA-Seq DE tools for Gene Set Enrichment Analysis pre-ranked but: 2. intercorrelation of genes will inflate p-values if GSEA (broad) is used with those log2FC. 3. This issue has been corrected in CAMERA approach but it works only for normal-limma voom data. (Your next publication in Feb 2014 will show how limma-voom can be used over DESeq (DESeq2?) and others with success. 4. CAMERA can calculate the enrichment for any Gene Set in MSigDB against the background of other sets, not against all the others, "one by one", taking account of the inter-correlations. So, the solutions I see would be: 1. Use limma/voom/Camera/MSigDB for all the collections of sets I am interested in (C2,C6,..) for all the gene sets individually and get their p-values of enrichment (intercorrelation OK). Should this work in giving me a list of top Enriched Gene Sets as GSEA(broad) does but with the correction for intercorrelation? I will explore CAMERA function promptly. 2. Use DESeq2/shrunken foldChanges/ find a way to use them into Camera or Camera-like approach. In the end, I am not a supporter or neither approach, DESeq2 over limma-voom or GSEA over Camera. What I would like is to have a statistically correct approach that takes the DE log2FC with their p-values and uses lists of relevant genes taking account of the size of change(~biological relevance) (in the form of a rank list or average FoldChanges or other) to calculate an enrichment. I chose the pre- ranked GSEA because of the problem of having to define cutoffs of significance, specially for fold changes. By the way, I was thinking in using other statistic to explore RNA-Seq results in a different way to classical DE that might result tricky to use with Gene Set Enrichment bearing in mind the intercorrelations problem. I am thinking about finding the correlation Pearson coefficient of each gene in my dataset for expression values in all samples to values of a set of genes of my choice (also in my dataset )and find whether most correlated genes (again without a cut off of R pearson would be better, like ranking from 1 to 0 or -1) are enriched in a selected pathway or category. What should I use in this case for gene set enrichment analysis? Reading your CAMERA paper discussion, I realise that the intercorrelation problem is so because using DE data we should underweight those gene sets with correlation because of your view, in my opinion correct in most yet not all cases, that inter-gene correlation 'reflects non-specific co-regulation, unrelated to the treatment condition'. That may hold true for most DE-driven GSEA tests. What about the test I last mentioned, where I am really interested in genes correlating to a set of genes across my whole dataset, regardless conditions (say all targets of the same miRNA)? And last, what if my set of genes used to find R pearson were defined based on a classical DE analysis of the same dataset, say with limma-voom? Thanks again for your help Best wishes Jose 2014/1/19 Gordon K Smyth <smyth@wehi.edu.au> > Hi Jose, > > Doing a one-sample t-test of the logFCs for a gene set is very similar to > the test proposed by > > http://www.ncbi.nlm.nih.gov/pubmed/20048385 > > (Actually that paper did a z-test of t-statistics, and criticized the test > of logFCs as ignoring unequal variances betweeen genes.] > > You're right that this is different from GSEA, and in fact it has been > criticized directly by the GSEA authors themselves: > > http://www.ncbi.nlm.nih.gov/pubmed/23070592 > > The point I was making in my previous email, is that the criticism made by > the GSEA authors of the logFC test applies equally to the GSEA pre- ranked > test. So I suspect the GSEA authors would not recommend the pre- ranked > test themselves unless there was no alternative (i.e., no replicates). > > Best wishes > Gordon > > > --------- original message ------------ > [BioC] pre-ranked GSEA within R? + Best DESeq2/limma-voom metric? > Garcia Manteiga Jose Manuel <garciamanteiga.josemanuel@hsr.it> > Fri Jan 17 17:52:01 CET 2014 > > Dear Mike, > Thanks for the confirmation, I remember talking to someone during the > Bioc2013 lab saying that same thing on shrunken log2FC but I do not know > why I thought that the value to use with pre-ranked was not the log2FC in > the results object but another one due to the shrinking procedure. > > On the other hand, on the workflow from the lab, I remember that in 5.1 > log2FC was used to test whether particular gene sets had an average log2FC > different from 0. That would use log2FC information but in a different way > compared to how I understand GSEA pre-ranked analysis (the procedure > published by the people in Broad). In the Broad pre-ranked strategy, the > enrichment using log2FCs from the top (positive and negative) is giving > you an enrichment score that should be more related, in my view, to > biological relevance since it is linked to highest log2FCs observed, in > absolute values, whereas the workflow in 5.1 would flag those gene sets > that do show a different behaviour but maybe very low just because we are > mathematically able to say so (because of the t-test used). > > By the way, could the thresholded test implementation in DESeq2 >1.3, or > something similar, be used in a similar way to filter for those gene sets > with a more relevant enrichment using the 5.1 workflow? > Thanks again for the efforts > > Jose > > > > On 17 Jan 2014, at 16:10, Michael Love <michaelisaiahlove@gmail.com> > wrote: > > hi Garcia, > > For DESeq2, we do recommend using shrunken log fold changes for gene set > tests, that is, the log2FoldChange column of a results object when the > default DESeq2 pipeline has been used. > > For a suggested workflow, see section 5.1 of the PDF for the presentation > "DESeq2/DEXSeq / Alejandro Reyes, Wolfgang Huber", one of the Bioc2013 > labs: http://bioconductor.org/help/course-materials/2013/BioC2013/ > > Mike > > > > On Fri, Jan 17, 2014 at 3:30 AM, Garcia Manteiga Jose Manuel > <garciamanteiga.josemanuel@hsr.it> wrote: > Hi, > I wanted the same and this is what gsea people replied to me some months > ago: > > > Hi, > > Thank you for your interest in GSEA. > > R GSEA does accept ranked list as an input. > > Please note that R GSEA has not been actively maintained since 2005. > > Regards, > > > I have not gone through the R GSEA again to check how pre-ranked lists can > be accepted, but the last sentence in their message made me hold a minute > and think in other possibilities. > For the moment I am still doing as you mentioned, using the rnk file with > java, then getting back again from results of gsea to R. > > By the way, I would like to known which metric should be best to be used > for that kind of analysis when using RNA-Seq data coming from DESeq2 > analysis. log2FC? p-values? Are they considered to be weighted, as GSEA > pre-ranked names them? > Thanks > Regards > > > > On 17 Jan 2014, at 00:18, Daniel Schmolze <bioconductor@schmolze.com> > wrote: > > I want to do a GSEA entirely from within R, using genes ranked by my > own metric. At the moment I'm saving my ranked genes to a .rnk file, > then calling Broad's GSEA java program, then reading the resulting > output back into R (all I care about are the p-values in > gsea_report_for_na_pos_####.xls and gsea_report_for_na_neg_####.xls). > Cumbersome to say the least. > > As far as I can tell, the Broad GSEA R script won't accept pre- ranked > genes, but maybe I'm wrong? If not, I'm interested in other options. > I'd like to specifically stick with the Broad GSEA algorithm if > possible. > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:23}}