Entering edit mode
Dear Garcia and Daniel,
If you must do a GSEA pre-ranked analysis, then I think the best
statistic
to use is a shrunken fold change, as it is the best predictor of the
size
of the biological effect for each gene. (Statistical significance
meauses
something slightly different.) The predFC() and glmFit() functions of
the
edgeR package have offered "predictive fold changes" for quite a few
years, and they would still be my choice for this purpose. voom and
DESeq2 are more recent alternatives, but I still recommend predFC()
with
prior.count=3 or 5 for this purpose. And predFC() will work even
without
biological replicates.
However a GSEA pre-ranked analysis does not (and cannot) give correct
p-values, regardless of the metric used, because it does not take
account
of inter-gene correlations. This has been shown by a number of
published
papers. For example, Wu et al (2012) showed that pre-ranked gene set
analyses give spectactularly wrong p-values even when the inter-gene
correlation is as low as 0.05. Have a look at Table 1 of:
http://nar.oxfordjournals.org/content/40/17/e133
The GSEA pre-ranked analysis has similar behaviour to "geneSetTest" in
this table. See the remark about GSEA in the last sentence of page 9.
If your data has biological replicates, then more reliable methods are
readily available. I don't know whether you want to test one gene set
or
the whole MSigDb collection. Either way, the mroast() or camera()
methods
of the limma package work seamlessly with either voom or edgeR for
RNA-seq
data, and either will take account inter-gene correlations. See for
example the roast() and camera() tests presented in the first case
study
of:
http://www.statsci.org/smyth/pubs/VoomPreprint.pdf
(This paper will appear in Genome Biology next month.)
Best wishes
Gordon
> Date: Fri, 17 Jan 2014 09:30:47 +0100
> From: Garcia Manteiga Jose Manuel <garciamanteiga.josemanuel at="" hsr.it="">
> To: Daniel Schmolze <bioconductor at="" schmolze.com="">
> Cc: "bioconductor at r-project.org" <bioconductor at="" r-project.org="">
> Subject: Re: [BioC] pre-ranked GSEA within R? + Best DESeq2/limma-
voom
> metric?
>
> Hi,
> I wanted the same and this is what gsea people replied to me some
months ago:
>
>
> Hi,
>
> Thank you for your interest in GSEA.
>
> R GSEA does accept ranked list as an input.
>
> Please note that R GSEA has not been actively maintained since 2005.
>
> Regards,
>
>
> I have not gone through the R GSEA again to check how pre-ranked
lists
> can be accepted, but the last sentence in their message made me hold
a
> minute and think in other possibilities.
> For the moment I am still doing as you mentioned, using the rnk file
> with java, then getting back again from results of gsea to R.
>
> By the way, I would like to known which metric should be best to be
used
> for that kind of analysis when using RNA-Seq data coming from DESeq2
> analysis. log2FC? p-values? Are they considered to be weighted, as
GSEA
> pre-ranked names them?
> Thanks
> Regards
>
>
>
> On 17 Jan 2014, at 00:18, Daniel Schmolze <bioconductor at="" schmolze.com<mailto:bioconductor="" at="" schmolze.com="">> wrote:
>
> I want to do a GSEA entirely from within R, using genes ranked by my
> own metric. At the moment I'm saving my ranked genes to a .rnk file,
> then calling Broad's GSEA java program, then reading the resulting
> output back into R (all I care about are the p-values in
> gsea_report_for_na_pos_####.xls and
gsea_report_for_na_neg_####.xls).
> Cumbersome to say the least.
>
> As far as I can tell, the Broad GSEA R script won't accept pre-
ranked
> genes, but maybe I'm wrong? If not, I'm interested in other options.
> I'd like to specifically stick with the Broad GSEA algorithm if
> possible.
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}