pre-ranked GSEA within R?

4

Entering edit mode

Daniel Schmolze ▴ 50

@daniel-schmolze-6337

Last seen 9.7 years ago

I want to do a GSEA entirely from within R, using genes ranked by my own metric. At the moment I'm saving my ranked genes to a .rnk file, then calling Broad's GSEA java program, then reading the resulting output back into R (all I care about are the p-values in gsea_report_for_na_pos_####.xls and gsea_report_for_na_neg_####.xls). Cumbersome to say the least. As far as I can tell, the Broad GSEA R script won't accept pre-ranked genes, but maybe I'm wrong? If not, I'm interested in other options. I'd like to specifically stick with the Broad GSEA algorithm if possible.

• 12k views

ADD COMMENT • link updated 7.7 years ago by alserg ▴ 260 • written 10.3 years ago by Daniel Schmolze ▴ 50

4

Entering edit mode

alserg ▴ 260

@assaron

Last seen 8 days ago

St Louis, MO

I think it might be of interest, that I implemented a fast algorithm for pre-ranked GSEA as an R-package "fgsea" which is available at Bioconductor (http://bioconductor.org/packages/fgsea) in devel-version and on GitHub (https://github.com/ctlab/fgsea).

When I say "fast", I mean a ten thousand permutations in ten seconds in one thread and the results are practically the same as in Broad version. If you want to know why it's fast you can check out the manuscript (http://biorxiv.org/content/early/2016/06/20/060012).

Best,

Alexey

ADD COMMENT • link 7.7 years ago alserg ▴ 260

1

Entering edit mode

Jose M Garcia Manteiga ▴ 310

@jose-m-garcia-manteiga-6046

Last seen 7.1 years ago

Italy

Hi, I wanted the same and this is what gsea people replied to me some months ago: Hi, Thank you for your interest in GSEA. R GSEA does accept ranked list as an input. Please note that R GSEA has not been actively maintained since 2005. Regards, I have not gone through the R GSEA again to check how pre-ranked lists can be accepted, but the last sentence in their message made me hold a minute and think in other possibilities. For the moment I am still doing as you mentioned, using the rnk file with java, then getting back again from results of gsea to R. By the way, I would like to known which metric should be best to be used for that kind of analysis when using RNA-Seq data coming from DESeq2 analysis. log2FC? p-values? Are they considered to be weighted, as GSEA pre-ranked names them? Thanks Regards On 17 Jan 2014, at 00:18, Daniel Schmolze <bioconductor@schmolze.com<mailto:bioconductor@schmolze.com>> wrote: I want to do a GSEA entirely from within R, using genes ranked by my own metric. At the moment I'm saving my ranked genes to a .rnk file, then calling Broad's GSEA java program, then reading the resulting output back into R (all I care about are the p-values in gsea_report_for_na_pos_####.xls and gsea_report_for_na_neg_####.xls). Cumbersome to say the least. As far as I can tell, the Broad GSEA R script won't accept pre-ranked genes, but maybe I'm wrong? If not, I'm interested in other options. I'd like to specifically stick with the Broad GSEA algorithm if possible. _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD COMMENT • link 10.3 years ago Jose M Garcia Manteiga ▴ 310

0

Entering edit mode

hi Garcia, For DESeq2, we do recommend using shrunken log fold changes for gene set tests, that is, the log2FoldChange column of a results object when the default DESeq2 pipeline has been used. For a suggested workflow, see section 5.1 of the PDF for the presentation "DESeq2/DEXSeq / Alejandro Reyes, Wolfgang Huber", one of the Bioc2013 labs: http://bioconductor.org/help/course-materials/2013/BioC2013/ Mike On Fri, Jan 17, 2014 at 3:30 AM, Garcia Manteiga Jose Manuel < garciamanteiga.josemanuel@hsr.it> wrote: > Hi, > I wanted the same and this is what gsea people replied to me some months > ago: > > > Hi, > > Thank you for your interest in GSEA. > > R GSEA does accept ranked list as an input. > > Please note that R GSEA has not been actively maintained since 2005. > > Regards, > > > I have not gone through the R GSEA again to check how pre-ranked lists can > be accepted, but the last sentence in their message made me hold a minute > and think in other possibilities. > For the moment I am still doing as you mentioned, using the rnk file with > java, then getting back again from results of gsea to R. > > By the way, I would like to known which metric should be best to be used > for that kind of analysis when using RNA-Seq data coming from DESeq2 > analysis. log2FC? p-values? Are they considered to be weighted, as GSEA > pre-ranked names them? > Thanks > Regards > > > > On 17 Jan 2014, at 00:18, Daniel Schmolze <bioconductor@schmolze.com> <mailto:bioconductor@schmolze.com>> wrote: > > I want to do a GSEA entirely from within R, using genes ranked by my > own metric. At the moment I'm saving my ranked genes to a .rnk file, > then calling Broad's GSEA java program, then reading the resulting > output back into R (all I care about are the p-values in > gsea_report_for_na_pos_####.xls and gsea_report_for_na_neg_####.xls). > Cumbersome to say the least. > > As far as I can tell, the Broad GSEA R script won't accept pre- ranked > genes, but maybe I'm wrong? If not, I'm interested in other options. > I'd like to specifically stick with the Broad GSEA algorithm if > possible. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org<mailto:bioconductor@r-project.org> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 10.3 years ago Michael Love 42k

0

Entering edit mode

Dear Mike, Thanks for the confirmation, I remember talking to someone during the Bioc2013 lab saying that same thing on shrunken log2FC but I do not know why I thought that the value to use with pre-ranked was not the log2FC in the results object but another one due to the shrinking procedure. On the other hand, on the workflow from the lab, I remember that in 5.1 log2FC was used to test whether particular gene sets had an average log2FC different from 0. That would use log2FC information but in a different way compared to how I understand GSEA pre-ranked analysis (the procedure published by the people in Broad). In the Broad pre-ranked strategy, the enrichment using log2FCs from the top (positive and negative) is giving you an enrichment score that should be more related, in my view, to biological relevance since it is linked to highest log2FCs observed, in absolute values, whereas the workflow in 5.1 would flag those gene sets that do show a different behaviour but maybe very low just because we are mathematically able to say so (because of the t-test used). By the way, could the thresholded test implementation in DESeq2 >1.3, or something similar, be used in a similar way to filter for those gene sets with a more relevant enrichment using the 5.1 workflow? Thanks again for the efforts Jose On 17 Jan 2014, at 16:10, Michael Love <michaelisaiahlove@gmail.com<mailto:michaelisaiahlove@gmail.com>> wrote: hi Garcia, For DESeq2, we do recommend using shrunken log fold changes for gene set tests, that is, the log2FoldChange column of a results object when the default DESeq2 pipeline has been used. For a suggested workflow, see section 5.1 of the PDF for the presentation "DESeq2/DEXSeq / Alejandro Reyes, Wolfgang Huber", one of the Bioc2013 labs: http://bioconductor.org/help/course- materials/2013/BioC2013/ Mike On Fri, Jan 17, 2014 at 3:30 AM, Garcia Manteiga Jose Manuel <garciama nteiga.josemanuel@hsr.it<mailto:garciamanteiga.josemanuel@hsr.it="">> wrote: Hi, I wanted the same and this is what gsea people replied to me some months ago: Hi, Thank you for your interest in GSEA. R GSEA does accept ranked list as an input. Please note that R GSEA has not been actively maintained since 2005. Regards, I have not gone through the R GSEA again to check how pre-ranked lists can be accepted, but the last sentence in their message made me hold a minute and think in other possibilities. For the moment I am still doing as you mentioned, using the rnk file with java, then getting back again from results of gsea to R. By the way, I would like to known which metric should be best to be used for that kind of analysis when using RNA-Seq data coming from DESeq2 analysis. log2FC? p-values? Are they considered to be weighted, as GSEA pre-ranked names them? Thanks Regards On 17 Jan 2014, at 00:18, Daniel Schmolze <bioconductor@schmolze.com<m ailto:bioconductor@schmolze.com=""><mailto:bioconductor@schmolze.com<mail to:bioconductor@schmolze.com="">>> wrote: I want to do a GSEA entirely from within R, using genes ranked by my own metric. At the moment I'm saving my ranked genes to a .rnk file, then calling Broad's GSEA java program, then reading the resulting output back into R (all I care about are the p-values in gsea_report_for_na_pos_####.xls and gsea_report_for_na_neg_####.xls). Cumbersome to say the least. As far as I can tell, the Broad GSEA R script won't accept pre-ranked genes, but maybe I'm wrong? If not, I'm interested in other options. I'd like to specifically stick with the Broad GSEA algorithm if possible. _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org><mailto:b ioconductor@r-project.org<mailto:bioconductor@r-project.org="">> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 10.3 years ago Jose M Garcia Manteiga ▴ 310

0

Entering edit mode

hi Jose, On Fri, Jan 17, 2014 at 11:52 AM, Garcia Manteiga Jose Manuel < garciamanteiga.josemanuel@hsr.it> wrote: > Dear Mike, > Thanks for the confirmation, I remember talking to someone during the > Bioc2013 lab saying that same thing on shrunken log2FC but I do not know > why I thought that the value to use with pre-ranked was not the log2FC in > the results object but another one due to the shrinking procedure. > > On the other hand, on the workflow from the lab, I remember that in 5.1 > log2FC was used to test whether particular gene sets had an average log2FC > different from 0. That would use log2FC information but in a different way > compared to how I understand GSEA pre-ranked analysis (the procedure > published by the people in Broad). In the Broad pre-ranked strategy, the > enrichment using log2FCs from the top (positive and negative) is giving you > an enrichment score that should be more related, in my view, to biological > relevance since it is linked to highest log2FCs observed, in absolute > values, whereas the workflow in 5.1 would flag those gene sets that do show > a different behaviour but maybe very low just because we are mathematically > able to say so (because of the t-test used). > > By the way, could the thresholded test implementation in DESeq2 >1.3, or > something similar, be used in a similar way to filter for those gene sets > with a more relevant âenrichmentâ using the 5.1 workflow? > Thanks again for the efforts > âNo, because using tests of log2 fold change with thresholds > 0 produces different p-values. The 5.1 workflow doesn't involve p-values at all.â â Mikeâ > > Jose > > > > On 17 Jan 2014, at 16:10, Michael Love <michaelisaiahlove@gmail.com> > wrote: > > hi Garcia, > > For DESeq2, we do recommend using shrunken log fold changes for gene set > tests, that is, the log2FoldChange column of a results object when the > default DESeq2 pipeline has been used. > > For a suggested workflow, see section 5.1 of the PDF for the presentation "DESeq2/DEXSeq > / Alejandro Reyes, Wolfgang Huber", one of the Bioc2013 labs: > http://bioconductor.org/help/course-materials/2013/BioC2013/ > > Mike > > > > On Fri, Jan 17, 2014 at 3:30 AM, Garcia Manteiga Jose Manuel < > garciamanteiga.josemanuel@hsr.it> wrote: > >> Hi, >> I wanted the same and this is what gsea people replied to me some months >> ago: >> >> >> Hi, >> >> Thank you for your interest in GSEA. >> >> R GSEA does accept ranked list as an input. >> >> Please note that R GSEA has not been actively maintained since 2005. >> >> Regards, >> >> >> I have not gone through the R GSEA again to check how pre-ranked lists >> can be accepted, but the last sentence in their message made me hold a >> minute and think in other possibilities. >> For the moment I am still doing as you mentioned, using the rnk file with >> java, then getting back again from results of gsea to R. >> >> By the way, I would like to known which metric should be best to be used >> for that kind of analysis when using RNA-Seq data coming from DESeq2 >> analysis. log2FC? p-values? Are they considered to be weighted, as GSEA >> pre-ranked names them? >> Thanks >> Regards >> >> >> >> On 17 Jan 2014, at 00:18, Daniel Schmolze <bioconductor@schmolze.com>> <mailto:bioconductor@schmolze.com>> wrote: >> >> I want to do a GSEA entirely from within R, using genes ranked by my >> own metric. At the moment I'm saving my ranked genes to a .rnk file, >> then calling Broad's GSEA java program, then reading the resulting >> output back into R (all I care about are the p-values in >> gsea_report_for_na_pos_####.xls and gsea_report_for_na_neg_####.xls). >> Cumbersome to say the least. >> >> As far as I can tell, the Broad GSEA R script won't accept pre- ranked >> genes, but maybe I'm wrong? If not, I'm interested in other options. >> I'd like to specifically stick with the Broad GSEA algorithm if >> possible. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org<mailto:bioconductor@r-project.org> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > [[alternative HTML version deleted]]

ADD REPLY • link 10.3 years ago Michael Love 42k

1

Entering edit mode

Pekka Kohonen ▴ 190

@pekka-kohonen-5862

Last seen 6.3 years ago

Sweden

Dear Daniel, The "piano" package does GSEA with pre-ranked genes. You can input, for instance, t-test statistics from limma/ebayes, fold-changes etc. But the method vveeeerrrryyyy sloooow compared to the Broad java implementation. Don't know why (maybe it would be a good idea to transition R to run on the jvm a la renjin, thought not sure if this project is still active). But you can run it.You may want to take a look at the options (if they are the same) Best, Pekka 2014/1/17 Daniel Schmolze <bioconductor at="" schmolze.com="">: > I want to do a GSEA entirely from within R, using genes ranked by my > own metric. At the moment I'm saving my ranked genes to a .rnk file, > then calling Broad's GSEA java program, then reading the resulting > output back into R (all I care about are the p-values in > gsea_report_for_na_pos_####.xls and gsea_report_for_na_neg_####.xls). > Cumbersome to say the least. > > As far as I can tell, the Broad GSEA R script won't accept pre- ranked > genes, but maybe I'm wrong? If not, I'm interested in other options. > I'd like to specifically stick with the Broad GSEA algorithm if > possible. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 10.3 years ago Pekka Kohonen ▴ 190

0

Entering edit mode

Dear Pekka and Daniel, Yes, the GSEA implementation in piano is one of the slowest (of the implemented methods), however since version 1.2.0 the argument ncpus was added and let's you run the algorithm on multiple cpus (using the R package snowfall) thus decreasing the runtime (especially if you have access to a computer cluster). In my trials running on 16 cpus I could save around 90% of the time compared to the previous version. All the best Leif Väremo On Fri, Jan 17, 2014 at 1:16 PM, Pekka Kohonen <pkpekka@gmail.com> wrote: > Dear Daniel, > > The "piano" package does GSEA with pre-ranked genes. You can input, > for instance, t-test statistics from limma/ebayes, fold-changes etc. > But the method vveeeerrrryyyy sloooow compared to the Broad java > implementation. Don't know why (maybe it would be a good idea to > transition R to run on the jvm a la renjin, thought not sure if this > project is still active). But you can run it.You may want to take a > look at the options (if they are the same) > > Best, Pekka > > 2014/1/17 Daniel Schmolze <bioconductor@schmolze.com>: > > I want to do a GSEA entirely from within R, using genes ranked by my > > own metric. At the moment I'm saving my ranked genes to a .rnk file, > > then calling Broad's GSEA java program, then reading the resulting > > output back into R (all I care about are the p-values in > > gsea_report_for_na_pos_####.xls and gsea_report_for_na_neg_####.xls). > > Cumbersome to say the least. > > > > As far as I can tell, the Broad GSEA R script won't accept pre- ranked > > genes, but maybe I'm wrong? If not, I'm interested in other options. > > I'd like to specifically stick with the Broad GSEA algorithm if > > possible. > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 10.3 years ago Leif Väremo ▴ 70

0

Entering edit mode

Der Leif, Excellent that the multi-cpu option is supported. "piano" is a very nice package, it is easy to use and runs the commonly used methods. Best, Pekka 2014/1/17 Leif V?remo <varemo at="" chalmers.se="">: > Dear Pekka and Daniel, > > Yes, the GSEA implementation in piano is one of the slowest (of the > implemented methods), however since version 1.2.0 the argument ncpus was > added and let's you run the algorithm on multiple cpus (using the R package > snowfall) thus decreasing the runtime (especially if you have access to a > computer cluster). In my trials running on 16 cpus I could save around 90% > of the time compared to the previous version. > > All the best > Leif V?remo > > > On Fri, Jan 17, 2014 at 1:16 PM, Pekka Kohonen <pkpekka at="" gmail.com=""> wrote: >> >> Dear Daniel, >> >> The "piano" package does GSEA with pre-ranked genes. You can input, >> for instance, t-test statistics from limma/ebayes, fold-changes etc. >> But the method vveeeerrrryyyy sloooow compared to the Broad java >> implementation. Don't know why (maybe it would be a good idea to >> transition R to run on the jvm a la renjin, thought not sure if this >> project is still active). But you can run it.You may want to take a >> look at the options (if they are the same) >> >> Best, Pekka >> >> 2014/1/17 Daniel Schmolze <bioconductor at="" schmolze.com="">: >> > I want to do a GSEA entirely from within R, using genes ranked by my >> > own metric. At the moment I'm saving my ranked genes to a .rnk file, >> > then calling Broad's GSEA java program, then reading the resulting >> > output back into R (all I care about are the p-values in >> > gsea_report_for_na_pos_####.xls and gsea_report_for_na_neg_####.xls). >> > Cumbersome to say the least. >> > >> > As far as I can tell, the Broad GSEA R script won't accept pre- ranked >> > genes, but maybe I'm wrong? If not, I'm interested in other options. >> > I'd like to specifically stick with the Broad GSEA algorithm if >> > possible. >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD REPLY • link 10.3 years ago Pekka Kohonen ▴ 190

0

Entering edit mode

Hello Pekka and Daniel,

Sorry for picking up the old thread, I just wanted to let you know that if you are still interested in this, I did some rewrite of piano sampling function, now it can do 10000 permutations for ~900 reactome pathways in about 10 min in a single thread. It did not went into the package on bioconductor, but you can build the package yourself from here: https://github.com/assaron/piano

Best,

Alexey

ADD REPLY • link 8.2 years ago alserg ▴ 260

0

Entering edit mode

Hi Alexey,

Does the speed improvement apply to geneSetStat == "gsea"?

ADD REPLY • link 8.1 years ago jeales • 0

0

Entering edit mode

Yes, I mainly worked on improving gsea statistic performance. There is also a branch fastGSEA in the github repo that has even faster version, however it's slightly different from the original formulation.

ADD REPLY • link 8.0 years ago alserg ▴ 260

0

Entering edit mode

A faster GSEA is great.

Did you implement it in the stable release?

I'm using version 1.10.2 and it is still slow.

ADD REPLY • link 7.8 years ago gil.hornung • 0

0

Entering edit mode

Please, check out the package fgsea (http://bioconductor.org/packages/devel/bioc/html/fgsea.html) it now has a very fast algorithm for pre-ranked GSEA. If you are not using devel version of Bioconductor you can install it from github using devtools::install_github("ctlab/fgsea")

ADD REPLY • link 7.7 years ago alserg ▴ 260

Login before adding your answer.