Question: pre-ranked GSEA within R? + Best DESeq2/limma-voom metric?
1
gravatar for Gordon Smyth
5.5 years ago by
Gordon Smyth38k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth38k wrote:
Hi Jose, Doing a one-sample t-test of the logFCs for a gene set is very similar to the test proposed by http://www.ncbi.nlm.nih.gov/pubmed/20048385 (Actually that paper did a z-test of t-statistics, and criticized the test of logFCs as ignoring unequal variances betweeen genes.] You're right that this is different from GSEA, and in fact it has been criticized directly by the GSEA authors themselves: http://www.ncbi.nlm.nih.gov/pubmed/23070592 The point I was making in my previous email, is that the criticism made by the GSEA authors of the logFC test applies equally to the GSEA pre- ranked test. So I suspect the GSEA authors would not recommend the pre- ranked test themselves unless there was no alternative (i.e., no replicates). Best wishes Gordon --------- original message ------------ [BioC] pre-ranked GSEA within R? + Best DESeq2/limma-voom metric? Garcia Manteiga Jose Manuel <garciamanteiga.josemanuel at="" hsr.it=""> Fri Jan 17 17:52:01 CET 2014 Dear Mike, Thanks for the confirmation, I remember talking to someone during the Bioc2013 lab saying that same thing on shrunken log2FC but I do not know why I thought that the value to use with pre-ranked was not the log2FC in the results object but another one due to the shrinking procedure. On the other hand, on the workflow from the lab, I remember that in 5.1 log2FC was used to test whether particular gene sets had an average log2FC different from 0. That would use log2FC information but in a different way compared to how I understand GSEA pre-ranked analysis (the procedure published by the people in Broad). In the Broad pre-ranked strategy, the enrichment using log2FCs from the top (positive and negative) is giving you an enrichment score that should be more related, in my view, to biological relevance since it is linked to highest log2FCs observed, in absolute values, whereas the workflow in 5.1 would flag those gene sets that do show a different behaviour but maybe very low just because we are mathematically able to say so (because of the t-test used). By the way, could the thresholded test implementation in DESeq2 >1.3, or something similar, be used in a similar way to filter for those gene sets with a more relevant enrichment using the 5.1 workflow? Thanks again for the efforts Jose On 17 Jan 2014, at 16:10, Michael Love <michaelisaiahlove at="" gmail.com=""> wrote: hi Garcia, For DESeq2, we do recommend using shrunken log fold changes for gene set tests, that is, the log2FoldChange column of a results object when the default DESeq2 pipeline has been used. For a suggested workflow, see section 5.1 of the PDF for the presentation "DESeq2/DEXSeq / Alejandro Reyes, Wolfgang Huber", one of the Bioc2013 labs: http://bioconductor.org/help/course-materials/2013/BioC2013/ Mike On Fri, Jan 17, 2014 at 3:30 AM, Garcia Manteiga Jose Manuel <garciamanteiga.josemanuel at="" hsr.it=""> wrote: Hi, I wanted the same and this is what gsea people replied to me some months ago: Hi, Thank you for your interest in GSEA. R GSEA does accept ranked list as an input. Please note that R GSEA has not been actively maintained since 2005. Regards, I have not gone through the R GSEA again to check how pre-ranked lists can be accepted, but the last sentence in their message made me hold a minute and think in other possibilities. For the moment I am still doing as you mentioned, using the rnk file with java, then getting back again from results of gsea to R. By the way, I would like to known which metric should be best to be used for that kind of analysis when using RNA-Seq data coming from DESeq2 analysis. log2FC? p-values? Are they considered to be weighted, as GSEA pre-ranked names them? Thanks Regards On 17 Jan 2014, at 00:18, Daniel Schmolze <bioconductor at="" schmolze.com=""> wrote: I want to do a GSEA entirely from within R, using genes ranked by my own metric. At the moment I'm saving my ranked genes to a .rnk file, then calling Broad's GSEA java program, then reading the resulting output back into R (all I care about are the p-values in gsea_report_for_na_pos_####.xls and gsea_report_for_na_neg_####.xls). Cumbersome to say the least. As far as I can tell, the Broad GSEA R script won't accept pre-ranked genes, but maybe I'm wrong? If not, I'm interested in other options. I'd like to specifically stick with the Broad GSEA algorithm if possible. ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
deseq2 • 2.3k views
ADD COMMENTlink modified 5.5 years ago by Jose M Garcia Manteiga310 • written 5.5 years ago by Gordon Smyth38k
Answer: pre-ranked GSEA within R? + Best DESeq2/limma-voom metric?
0
gravatar for Jose M Garcia Manteiga
5.5 years ago by
Italy
Jose M Garcia Manteiga310 wrote:
Dear Gordon, First of all, thank you very much for your replies to our questions. For a non-statistician it is always a pleasure (and a MUST) bearing in mind the actual statistics behind the approaches, explained in a clear way showing its advantages, what they do, what they do not/cannot do, caveats and comparisons. Let me see if I got the point in my words: 1. log2FC (shrunken) would be the one to be used for DESeq2 and other RNA-Seq DE tools for Gene Set Enrichment Analysis pre-ranked but: 2. intercorrelation of genes will inflate p-values if GSEA (broad) is used with those log2FC. 3. This issue has been corrected in CAMERA approach but it works only for normal-limma voom data. (Your next publication in Feb 2014 will show how limma-voom can be used over DESeq (DESeq2?) and others with success. 4. CAMERA can calculate the enrichment for any Gene Set in MSigDB against the background of other sets, not against all the others, "one by one", taking account of the inter-correlations. So, the solutions I see would be: 1. Use limma/voom/Camera/MSigDB for all the collections of sets I am interested in (C2,C6,..) for all the gene sets individually and get their p-values of enrichment (intercorrelation OK). Should this work in giving me a list of top Enriched Gene Sets as GSEA(broad) does but with the correction for intercorrelation? I will explore CAMERA function promptly. 2. Use DESeq2/shrunken foldChanges/ find a way to use them into Camera or Camera-like approach. In the end, I am not a supporter or neither approach, DESeq2 over limma-voom or GSEA over Camera. What I would like is to have a statistically correct approach that takes the DE log2FC with their p-values and uses lists of relevant genes taking account of the size of change(~biological relevance) (in the form of a rank list or average FoldChanges or other) to calculate an enrichment. I chose the pre- ranked GSEA because of the problem of having to define cutoffs of significance, specially for fold changes. By the way, I was thinking in using other statistic to explore RNA-Seq results in a different way to classical DE that might result tricky to use with Gene Set Enrichment bearing in mind the intercorrelations problem. I am thinking about finding the correlation Pearson coefficient of each gene in my dataset for expression values in all samples to values of a set of genes of my choice (also in my dataset )and find whether most correlated genes (again without a cut off of R pearson would be better, like ranking from 1 to 0 or -1) are enriched in a selected pathway or category. What should I use in this case for gene set enrichment analysis? Reading your CAMERA paper discussion, I realise that the intercorrelation problem is so because using DE data we should underweight those gene sets with correlation because of your view, in my opinion correct in most yet not all cases, that inter-gene correlation 'reflects non-specific co-regulation, unrelated to the treatment condition'. That may hold true for most DE-driven GSEA tests. What about the test I last mentioned, where I am really interested in genes correlating to a set of genes across my whole dataset, regardless conditions (say all targets of the same miRNA)? And last, what if my set of genes used to find R pearson were defined based on a classical DE analysis of the same dataset, say with limma-voom? Thanks again for your help Best wishes Jose 2014/1/19 Gordon K Smyth <smyth@wehi.edu.au> > Hi Jose, > > Doing a one-sample t-test of the logFCs for a gene set is very similar to > the test proposed by > > http://www.ncbi.nlm.nih.gov/pubmed/20048385 > > (Actually that paper did a z-test of t-statistics, and criticized the test > of logFCs as ignoring unequal variances betweeen genes.] > > You're right that this is different from GSEA, and in fact it has been > criticized directly by the GSEA authors themselves: > > http://www.ncbi.nlm.nih.gov/pubmed/23070592 > > The point I was making in my previous email, is that the criticism made by > the GSEA authors of the logFC test applies equally to the GSEA pre- ranked > test. So I suspect the GSEA authors would not recommend the pre- ranked > test themselves unless there was no alternative (i.e., no replicates). > > Best wishes > Gordon > > > --------- original message ------------ > [BioC] pre-ranked GSEA within R? + Best DESeq2/limma-voom metric? > Garcia Manteiga Jose Manuel <garciamanteiga.josemanuel@hsr.it> > Fri Jan 17 17:52:01 CET 2014 > > Dear Mike, > Thanks for the confirmation, I remember talking to someone during the > Bioc2013 lab saying that same thing on shrunken log2FC but I do not know > why I thought that the value to use with pre-ranked was not the log2FC in > the results object but another one due to the shrinking procedure. > > On the other hand, on the workflow from the lab, I remember that in 5.1 > log2FC was used to test whether particular gene sets had an average log2FC > different from 0. That would use log2FC information but in a different way > compared to how I understand GSEA pre-ranked analysis (the procedure > published by the people in Broad). In the Broad pre-ranked strategy, the > enrichment using log2FCs from the top (positive and negative) is giving > you an enrichment score that should be more related, in my view, to > biological relevance since it is linked to highest log2FCs observed, in > absolute values, whereas the workflow in 5.1 would flag those gene sets > that do show a different behaviour but maybe very low just because we are > mathematically able to say so (because of the t-test used). > > By the way, could the thresholded test implementation in DESeq2 >1.3, or > something similar, be used in a similar way to filter for those gene sets > with a more relevant enrichment using the 5.1 workflow? > Thanks again for the efforts > > Jose > > > > On 17 Jan 2014, at 16:10, Michael Love <michaelisaiahlove@gmail.com> > wrote: > > hi Garcia, > > For DESeq2, we do recommend using shrunken log fold changes for gene set > tests, that is, the log2FoldChange column of a results object when the > default DESeq2 pipeline has been used. > > For a suggested workflow, see section 5.1 of the PDF for the presentation > "DESeq2/DEXSeq / Alejandro Reyes, Wolfgang Huber", one of the Bioc2013 > labs: http://bioconductor.org/help/course-materials/2013/BioC2013/ > > Mike > > > > On Fri, Jan 17, 2014 at 3:30 AM, Garcia Manteiga Jose Manuel > <garciamanteiga.josemanuel@hsr.it> wrote: > Hi, > I wanted the same and this is what gsea people replied to me some months > ago: > > > Hi, > > Thank you for your interest in GSEA. > > R GSEA does accept ranked list as an input. > > Please note that R GSEA has not been actively maintained since 2005. > > Regards, > > > I have not gone through the R GSEA again to check how pre-ranked lists can > be accepted, but the last sentence in their message made me hold a minute > and think in other possibilities. > For the moment I am still doing as you mentioned, using the rnk file with > java, then getting back again from results of gsea to R. > > By the way, I would like to known which metric should be best to be used > for that kind of analysis when using RNA-Seq data coming from DESeq2 > analysis. log2FC? p-values? Are they considered to be weighted, as GSEA > pre-ranked names them? > Thanks > Regards > > > > On 17 Jan 2014, at 00:18, Daniel Schmolze <bioconductor@schmolze.com> > wrote: > > I want to do a GSEA entirely from within R, using genes ranked by my > own metric. At the moment I'm saving my ranked genes to a .rnk file, > then calling Broad's GSEA java program, then reading the resulting > output back into R (all I care about are the p-values in > gsea_report_for_na_pos_####.xls and gsea_report_for_na_neg_####.xls). > Cumbersome to say the least. > > As far as I can tell, the Broad GSEA R script won't accept pre- ranked > genes, but maybe I'm wrong? If not, I'm interested in other options. > I'd like to specifically stick with the Broad GSEA algorithm if > possible. > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:23}}
ADD COMMENTlink written 5.5 years ago by Jose M Garcia Manteiga310
Hi Jose, This is a big topic, and you raise a lot of issues. I'll just try to make some brief responses. On Mon, 20 Jan 2014, Jose Garcia wrote: > Dear Gordon, > First of all, thank you very much for your replies to our questions. For a > non-statistician it is always a pleasure (and a MUST) bearing in mind the > actual statistics behind the approaches, explained in a clear way showing > its advantages, what they do, what they do not/cannot do, caveats and > comparisons. > > Let me see if I got the point in my words: > > 1. log2FC (shrunken) would be the one to be used for DESeq2 and other > RNA-Seq DE tools for Gene Set Enrichment Analysis pre-ranked but: > 2. intercorrelation of genes will inflate p-values if GSEA (broad) is used > with those log2FC. Yes, a pre-ranked GSEA analysis will inflate significant (make p-values smaller). You should only do a pre-ranked analysis if you have no alternative, for example if you have no replicates. If you have no replicates, then as far as I know the only shrunk logFC available to you is that produced by predFC() in the edgeR package. Note that a standard GSEA analysis, as described in the GSEA publications, requires lots of replicates and does not inflate significance. > 3. This issue has been corrected in CAMERA approach but it works only > for normal-limma voom data. (Your next publication in Feb 2014 will show > how limma-voom can be used over DESeq (DESeq2?) and others with success. At the time we wrote the voom paper, CAMERA and ROAST worked with voom but not with edgeR. CAMERA and ROAST now work with edgeR as well. For help see library(edgeR) ?camera.DGEList > 4. CAMERA can calculate the enrichment for any Gene Set in MSigDB > against the background of other sets, not against all the others, "one > by one", taking account of the inter-correlations. I'm not quite sure what you mean. CAMERA does take account of inter-correlations, as you say. It can test against the background of all other sets or against the background of all genes/probes in the genome. By default the background is all genes in the complete data but not in the set. > So, the solutions I see would be: > > 1. Use limma/voom/Camera/MSigDB for all the collections of sets I am > interested in (C2,C6,..) for all the gene sets individually and get their > p-values of enrichment (intercorrelation OK). Should this work in giving me > a list of top Enriched Gene Sets as GSEA(broad) does but with the > correction for intercorrelation? I will explore CAMERA function promptly. Yes. Note that a regular GSEA analysis (not pre-ranked) also corrects for inter-gene correlations, but requires more replicates than CAMERA or ROAST do. > 2. Use DESeq2/shrunken foldChanges/ find a way to use them into Camera or > Camera-like approach. No, this would not make sense. > In the end, I am not a supporter or neither approach, DESeq2 over > limma-voom or GSEA over Camera. What I would like is to have a > statistically correct approach that takes the DE log2FC with their > p-values and uses lists of relevant genes taking account of the size of > change(~biological relevance) (in the form of a rank list or average > FoldChanges or other) to calculate an enrichment. I chose the pre- ranked > GSEA because of the problem of having to define cutoffs of significance, > specially for fold changes. None of the gene set testing methods require cutoffs. This is one of things that distinguish them from overlap tests (like an standard GO analysis with Fisher tests). > By the way, I was thinking in using other statistic to explore RNA- Seq > results in a different way to classical DE that might result tricky to use > with Gene Set Enrichment bearing in mind the intercorrelations problem. I > am thinking about finding the correlation Pearson coefficient of each gene > in my dataset for expression values in all samples to values of a set of > genes of my choice (also in my dataset )and find whether most correlated > genes (again without a cut off of R pearson would be better, like ranking > from 1 to 0 or -1) are enriched in a selected pathway or category. What > should I use in this case for gene set enrichment analysis? Well, I have done almost the same analysis that you suggest in the past: http://www.biomedcentral.com/1471-2164/9/363 In that paper, I ranked genes by their correlation with RUNX1 and then tested for enrichment of various sets of genes using the geneSetTest() function of the limma package. This is similar to a GSEA pre-ranked test but rather seemed to me to be simpler and more convenient. Note that Pearson correlation is not ideal. A better approach would be to regress all other genes on your gene of interest. You can do this by adding the log-expression values (logCPM say) of the gene of interest as a column to the design matrix when running voom or edgeR. We this approach, you can then used the ROAST or CAMERA functions to test for pathways that are correlated with your gene of interest while accounting for inter- gene correlations. > Reading your CAMERA paper discussion, I realise that the > intercorrelation problem is so because using DE data we should > underweight those gene sets with correlation because of your view, in my > opinion correct in most yet not all cases, that inter-gene correlation > 'reflects non-specific co-regulation, unrelated to the treatment > condition'. That may hold true for most DE-driven GSEA tests. In the CAMERA method, we only use the non-specific inter-gene correlation that is not associated with the treatment condition. So we are making an observation rather than an assumption. > What about the test I last mentioned, where I am really interested in > genes correlating to a set of genes across my whole dataset, regardless > conditions (say all targets of the same miRNA)? And last, what if my set > of genes used to find R pearson were defined based on a classical DE > analysis of the same dataset, say with limma-voom? Sounds like you would be double-dipping. Best wishes Gordon > Thanks again for your help > Best wishes > Jose > > > > 2014/1/19 Gordon K Smyth <smyth at="" wehi.edu.au=""> > >> Hi Jose, >> >> Doing a one-sample t-test of the logFCs for a gene set is very similar to >> the test proposed by >> >> http://www.ncbi.nlm.nih.gov/pubmed/20048385 >> >> (Actually that paper did a z-test of t-statistics, and criticized the test >> of logFCs as ignoring unequal variances betweeen genes.] >> >> You're right that this is different from GSEA, and in fact it has been >> criticized directly by the GSEA authors themselves: >> >> http://www.ncbi.nlm.nih.gov/pubmed/23070592 >> >> The point I was making in my previous email, is that the criticism made by >> the GSEA authors of the logFC test applies equally to the GSEA pre- ranked >> test. So I suspect the GSEA authors would not recommend the pre- ranked >> test themselves unless there was no alternative (i.e., no replicates). >> >> Best wishes >> Gordon >> >> >> --------- original message ------------ >> [BioC] pre-ranked GSEA within R? + Best DESeq2/limma-voom metric? >> Garcia Manteiga Jose Manuel <garciamanteiga.josemanuel at="" hsr.it=""> >> Fri Jan 17 17:52:01 CET 2014 >> >> Dear Mike, >> Thanks for the confirmation, I remember talking to someone during the >> Bioc2013 lab saying that same thing on shrunken log2FC but I do not know >> why I thought that the value to use with pre-ranked was not the log2FC in >> the results object but another one due to the shrinking procedure. >> >> On the other hand, on the workflow from the lab, I remember that in 5.1 >> log2FC was used to test whether particular gene sets had an average log2FC >> different from 0. That would use log2FC information but in a different way >> compared to how I understand GSEA pre-ranked analysis (the procedure >> published by the people in Broad). In the Broad pre-ranked strategy, the >> enrichment using log2FCs from the top (positive and negative) is giving >> you an enrichment score that should be more related, in my view, to >> biological relevance since it is linked to highest log2FCs observed, in >> absolute values, whereas the workflow in 5.1 would flag those gene sets >> that do show a different behaviour but maybe very low just because we are >> mathematically able to say so (because of the t-test used). >> >> By the way, could the thresholded test implementation in DESeq2 >1.3, or >> something similar, be used in a similar way to filter for those gene sets >> with a more relevant enrichment using the 5.1 workflow? >> Thanks again for the efforts >> >> Jose >> >> >> >> On 17 Jan 2014, at 16:10, Michael Love <michaelisaiahlove at="" gmail.com=""> >> wrote: >> >> hi Garcia, >> >> For DESeq2, we do recommend using shrunken log fold changes for gene set >> tests, that is, the log2FoldChange column of a results object when the >> default DESeq2 pipeline has been used. >> >> For a suggested workflow, see section 5.1 of the PDF for the presentation >> "DESeq2/DEXSeq / Alejandro Reyes, Wolfgang Huber", one of the Bioc2013 >> labs: http://bioconductor.org/help/course-materials/2013/BioC2013/ >> >> Mike >> >> >> >> On Fri, Jan 17, 2014 at 3:30 AM, Garcia Manteiga Jose Manuel >> <garciamanteiga.josemanuel at="" hsr.it=""> wrote: >> Hi, >> I wanted the same and this is what gsea people replied to me some months >> ago: >> >> >> Hi, >> >> Thank you for your interest in GSEA. >> >> R GSEA does accept ranked list as an input. >> >> Please note that R GSEA has not been actively maintained since 2005. >> >> Regards, >> >> >> I have not gone through the R GSEA again to check how pre-ranked lists can >> be accepted, but the last sentence in their message made me hold a minute >> and think in other possibilities. >> For the moment I am still doing as you mentioned, using the rnk file with >> java, then getting back again from results of gsea to R. >> >> By the way, I would like to known which metric should be best to be used >> for that kind of analysis when using RNA-Seq data coming from DESeq2 >> analysis. log2FC? p-values? Are they considered to be weighted, as GSEA >> pre-ranked names them? >> Thanks >> Regards >> >> >> >> On 17 Jan 2014, at 00:18, Daniel Schmolze <bioconductor at="" schmolze.com=""> >> wrote: >> >> I want to do a GSEA entirely from within R, using genes ranked by my >> own metric. At the moment I'm saving my ranked genes to a .rnk file, >> then calling Broad's GSEA java program, then reading the resulting >> output back into R (all I care about are the p-values in >> gsea_report_for_na_pos_####.xls and gsea_report_for_na_neg_####.xls). >> Cumbersome to say the least. >> >> As far as I can tell, the Broad GSEA R script won't accept pre- ranked >> genes, but maybe I'm wrong? If not, I'm interested in other options. >> I'd like to specifically stick with the Broad GSEA algorithm if >> possible. >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the >> addressee. >> You must not disclose, forward, print or use it without the permission of >> the sender. >> ______________________________________________________________________ >> > > > > -- > Jose M. Garcia Manteiga PhD > Research Assistant - Data Analysis in Functional Genomics > Center for Translational Genomics and BioInformatics > Dibit2-Basilica, 3A3 > San Raffaele Scientific Institute > Via Olgettina 58, 20132 Milano (MI), Italy > > Tel: +39-02-2643-4751 > e-mail: garciamanteiga.josemanuel at hsr.it > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD REPLYlink written 5.5 years ago by Gordon Smyth38k

Dear Gordon,

Thanks for the reply. I will try the regression using logCPM values. If I have more than one gene to regress to, I guess I could use some design formula that could try to regress to all of them, with a the highest coefficient. What could be such formula?

For R pearson I used the Averaged R coefficient and applied a Rank Product approach on them.

With the regression approach, In the end I would finish with a limma-coef (<0 anticorrealtion >0 correlation with a p.value) that I could use for geneSetTest.

Thank you very much for the suggestion.
Best
Jose

 

ADD REPLYlink modified 14 months ago by Gordon Smyth38k • written 5.5 years ago by Jose M Garcia Manteiga310

If just two or three genes to regress on, then you could possibly put all of them in the design matrix.  Otherwise I would consider using the first principle component of the log expression values.

Gordon

ADD REPLYlink modified 14 months ago • written 5.5 years ago by Gordon Smyth38k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 205 users visited in the last hour