Hello all,
I have list of differentially expressed genes from an rna-seq
analysis.
Also, I have a two-column annotation file for the organism with the
columns being gene and goterm.
please guide me towards a bioconductor package or any other tool that
I
could use my list and annotation file as
input and do gene set enrichment analysis.
Thanks,
Al
[[alternative HTML version deleted]]
If I may chime in:
GSEA does work with pre-sorted gene lists; human gene names were
required last time I looked.
I found ranking by fold-change alone not to be satisfactory for GSEA,
as this ignores the statistics outcome of a differential expression
test.
Ranking by (-log10(padj))*(log2(ratio)) works a bit better but still
lets fold-change outliers (high fold-change but not significant) pass
through.
I ended up constructing my ranked gene list in four parts:
1. Statistically significant (padj derived from either DESeq or
edgeR), up-regulated ranked descending by fold change
2. Not significant, expression increased or no change, ranked
ascending by p value
3. Not significant, expression decreased or no change, ranked
descending by p value
4. Statistically significant, down-regulated, ranked descending by
fold change
Not elegant, but somewhat workable; GSEA calls have to be scrutinized
at the 1-2 and 3-4 boundary.
Cheers, michael
J. Michael Salbaum, Ph.D.
Associate Professor
Pennington Biomedical Research Center
Louisiana State University System
6400 Perkins Road
Baton Rouge, LA 70808
(225) 763-2782
-----Original Message-----
From: bioconductor-bounces@r-project.org on behalf of Steve Lianoglou
Sent: Sun 12/2/2012 4:41 AM
To: Gordon K Smyth
Cc: Bioconductor mailing list
Subject: Re: [BioC] gene set enrichment
Hi Gordon,
When an expert comments on a topic I'm interested in, it's hard for me
not to press for more insight so I hope you don't mind, but also ...
you know .. take your time :-)
On Sat, Dec 1, 2012 at 8:39 PM, Gordon K Smyth <smyth@wehi.edu.au>
wrote:
[snip]
> The term "gene set enrichment analysis" was coined by the Broad
Institute:
>
> http://www.broadinstitute.org/gsea/
>
> but you certainly can't simply give a list of genes to GSEA. It
requires
> complete data and is designed for microarrays rather than RNA-Seq
anyway.
I'm curious if you say so because GSEA doesn't account for something
like length bias? The GSEA folks seem to suggest that one could do
this like any other "pre-processed" GSEA analysis by simply providing
a ranked list of genes (presumably by fold-change):
http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/FAQ#
Can_I_use_GSEA_to_analyze_SNP.2C_SAGE.2C_ChIP-Seq_or_RNA-Seq_data.3F
Would you mind (briefly) elaborating a bit on why you disagree?
Thanks,
-steve
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
_______________________________________________
Bioconductor mailing list
Bioconductor@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
[[alternative HTML version deleted]]
Hi
If you have the entrez id for your differentially expressed genes you
can
try GeneAnswer and ClusterProfiler. Once you have the idea (how to use
these packages for GSEA) of these packages, then you would be able to
utilize your go terms along with the differentailly expressed gene
list for
further analysis.
Regards
Reema Singh
On Sun, Dec 2, 2012 at 5:57 AM, Alpesh Querer <alpeshq@gmail.com>
wrote:
> Hello all,
>
> I have list of differentially expressed genes from an rna-seq
analysis.
> Also, I have a two-column annotation file for the organism with the
> columns being gene and goterm.
> please guide me towards a bioconductor package or any other tool
that I
> could use my list and annotation file as
> input and do gene set enrichment analysis.
>
> Thanks,
> Al
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
[[alternative HTML version deleted]]
Hi Gordon,
When an expert comments on a topic I'm interested in, it's hard for me
not to press for more insight so I hope you don't mind, but also ...
you know .. take your time :-)
On Sat, Dec 1, 2012 at 8:39 PM, Gordon K Smyth <smyth at="" wehi.edu.au="">
wrote:
[snip]
> The term "gene set enrichment analysis" was coined by the Broad
Institute:
>
> http://www.broadinstitute.org/gsea/
>
> but you certainly can't simply give a list of genes to GSEA. It
requires
> complete data and is designed for microarrays rather than RNA-Seq
anyway.
I'm curious if you say so because GSEA doesn't account for something
like length bias? The GSEA folks seem to suggest that one could do
this like any other "pre-processed" GSEA analysis by simply providing
a ranked list of genes (presumably by fold-change):
http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/FAQ#
Can_I_use_GSEA_to_analyze_SNP.2C_SAGE.2C_ChIP-Seq_or_RNA-Seq_data.3F
Would you mind (briefly) elaborating a bit on why you disagree?
Thanks,
-steve
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
Perhaps Dr. Smyth is referring to the uncorrected Type I inflation
that can
be introduced by correlation within gene sets, and which seems to
remain
uncorrected in typical gene set analyses? Di Wu wrote a nice paper on
this, centered on the 'camera' function, which indicated that severe
type I
inflation could be reined in by empirically correcting for the
correlation
within sets.
http://nar.oxfordjournals.org/content/40/17/e133
I am not an expert but I found the paper interesting, moreso in light
of
papers from Rick Young's lab at the Whitehead Institute which, in so
many
words, suggest that widespread transcription amplification by (e.g.)
c-Myc
may render many assumptions underlying quantile normalization invalid.
It
would seem that many assumptions from microarray analysis are due for
re-examination if my observations are not far off base. But, I am not
an
expert and would love to hear from those who are.
On Sun, Dec 2, 2012 at 2:41 AM, Steve Lianoglou <
mailinglist.honeypot@gmail.com> wrote:
> Hi Gordon,
>
> When an expert comments on a topic I'm interested in, it's hard for
me
> not to press for more insight so I hope you don't mind, but also ...
> you know .. take your time :-)
>
> On Sat, Dec 1, 2012 at 8:39 PM, Gordon K Smyth <smyth@wehi.edu.au>
wrote:
> [snip]
> > The term "gene set enrichment analysis" was coined by the Broad
> Institute:
> >
> > http://www.broadinstitute.org/gsea/
> >
> > but you certainly can't simply give a list of genes to GSEA. It
requires
> > complete data and is designed for microarrays rather than RNA-Seq
anyway.
>
> I'm curious if you say so because GSEA doesn't account for something
> like length bias? The GSEA folks seem to suggest that one could do
> this like any other "pre-processed" GSEA analysis by simply
providing
> a ranked list of genes (presumably by fold-change):
>
>
> http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/FA
Q#Can_I_use_GSEA_to_analyze_SNP.2C_SAGE.2C_ChIP-Seq_or_RNA-Seq_data.3F
>
> Would you mind (briefly) elaborating a bit on why you disagree?
>
> Thanks,
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
> | Memorial Sloan-Kettering Cancer Center
> | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
*A model is a lie that helps you see the truth.*
*
*
Howard
Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">
[[alternative HTML version deleted]]
I said that GSEA requires full data because this is true of the published GSEA algorithm (Subramanian et al 2005). The published GSEA approach permutes arrays and therefore requires all the data. I forgot that the GSEA software provides an alternative short-cut approach (permuting genes) that can be used when there are no replicates or one just has a ranked gene list.
The GSEA ranked gene list approach is similar in principle to the geneSetTest() function in the limma package. This approach has the disadvantage that it does not correct for intra-gene correlations, as we pointed out in our recent camera paper (thanks to Tim Triche for giving the reference).
However the same criticism (that intra-gene correlation is ignored) can be made of all GO overlap analysis softwares as well including goseq. So the only clear advantage of goseq over GSEA here is the adjustment for gene length. As compensation, GSEA-ranked-list uses the rankings of the DE genes that goseq ignores.
As you probably know, the whole area of gene set testing is a hot area of research, and the inter-relationships between the many different
approaches are still imperfectly understood. Methods like geneSetTest and GSEA-ranked-list are anti-conservative. Methods like roast, camera or classic GSEA are conservative and safe. GO overlap analyses like goseq, GOStat, DAVID etc are anti-conservative in principle but, in practice, multiple testing conservatism tends to make them conservative. Different approaches test different hypotheses and emphasise different aspects of the data.
The function phyper() can help you with this.
We also have a package called CORNA
(http://corna.sourceforge.net/tutorial.html) that might help, but this
needs to be updated for the latest version of R.
Mick
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
-----Original Message-----
From: bioconductor-bounces@r-project.org [mailto:bioconductor-
bounces@r-project.org] On Behalf Of Alpesh Querer
Sent: 02 December 2012 00:28
To: Bioconductor mailing list
Subject: [BioC] gene set enrichment
Hello all,
I have list of differentially expressed genes from an rna-seq
analysis.
Also, I have a two-column annotation file for the organism with the
columns being gene and goterm.
please guide me towards a bioconductor package or any other tool that
I could use my list and annotation file as input and do gene set
enrichment analysis.
Thanks,
Al
[[alternative HTML version deleted]]
_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
Thanks Gordon.
I was trying to install the latest version of R and goseq, but it
wouldn't
load anymore.
do you have an insight on why this would happen? maybe i`m doing
something
not right.
> biocLite("goseq")
BioC_mirror: http://bioconductor.org
Using Bioconductor version 2.11 (BiocInstaller 1.8.3), R version 2.15.
Installing package(s) 'goseq'
trying URL '
http://bioconductor.org/packages/2.11/bioc/bin/windows/contrib/2.15/go
seq_1.10.0.zip
'
Content type 'application/zip' length 751702 bytes (734 Kb)
opened URL
downloaded 734 Kb
package goseq successfully unpacked and MD5 sums checked
> library(goseq)
Loading required package: BiasedUrn
Loading required package: geneLenDataBase
Error in loadNamespace(i[[1L]], c(lib.loc, .libPaths())) :
there is no package called Biobase
Error: package geneLenDataBase could not be loaded
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United
States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] BiasedUrn_1.04 BiocInstaller_1.8.3
loaded via a namespace (and not attached):
[1] BiocGenerics_0.4.0 Biostrings_2.26.2 bitops_1.0-5
BSgenome_1.26.1 DBI_0.2-5 GenomicRanges_1.10.5
IRanges_1.16.4
[8] parallel_2.15.2 RCurl_1.95-3 Rsamtools_1.10.2
RSQLite_0.11.2 rtracklayer_1.18.1 stats4_2.15.2
tools_2.15.2
[15] XML_3.95-0.1 zlibbioc_1.4.0
Thanks,
Al
On Sat, Dec 1, 2012 at 7:39 PM, Gordon K Smyth <smyth@wehi.edu.au>
wrote:
> Dear Al,
>
> The obvious answer is the goseq package. However you have already
> received assistance with goseq:
>
> https://stat.ethz.ch/**pipermail/bioconductor/2012-**
> February/043779.html<https: stat.ethz.ch="" pipermail="" bioconductor="" 201="" 2-february="" 043779.html="">
>
> So if you are not trying to do a Gene Ontology analysis like goseq
does,
> what is it that you are trying to do?
>
> The term "gene set enrichment analysis" was coined by the Broad
Institute:
>
> http://www.broadinstitute.org/**gsea/<http: www.broadinstitute.or="" g="" gsea=""/>
>
> but you certainly can't simply give a list of genes to GSEA. It
requires
> complete data and is designed for microarrays rather than RNA-Seq
anyway.
>
> Best wishes
> Gordon
>
> ----------------- original message -----------------
> [BioC] gene set enrichment
> Alpesh Querer alpeshq at gmail.com
> Sun Dec 2 01:27:41 CET 2012
>
> Hello all,
>
> I have list of differentially expressed genes from an rna-seq
analysis.
> Also, I have a two-column annotation file for the organism with the
columns
> being gene and goterm. please guide me towards a bioconductor
package or
> any other tool that I could use my list and annotation file as input
and do
> gene set enrichment analysis.
>
> Thanks,
> Al
>
> ______________________________**______________________________**____
______
> The information in this email is confidential and
inte...{{dropped:10}}
On 12/03/2012 09:05 AM, Alpesh Querer wrote:
> Thanks Gordon.
>
> I was trying to install the latest version of R and goseq, but it
wouldn't
> load anymore.
> do you have an insight on why this would happen? maybe i`m doing
something
> not right.
>
>
>> biocLite("goseq")
> BioC_mirror: http://bioconductor.org
> Using Bioconductor version 2.11 (BiocInstaller 1.8.3), R version
2.15.
> Installing package(s) 'goseq'
> trying URL '
> http://bioconductor.org/packages/2.11/bioc/bin/windows/contrib/2.15/
goseq_1.10.0.zip
> '
> Content type 'application/zip' length 751702 bytes (734 Kb)
> opened URL
> downloaded 734 Kb
>
> package ?goseq? successfully unpacked and MD5 sums checked
>
>
>> library(goseq)
> Loading required package: BiasedUrn
> Loading required package: geneLenDataBase
> Error in loadNamespace(i[[1L]], c(lib.loc, .libPaths())) :
> there is no package called ?Biobase?
> Error: package ?geneLenDataBase? could not be loaded
geneLenDataBase (?) seems to be missing a dependency. Try
biocLite("Biobase")
first.
Martin
>
>
>> sessionInfo()
> R version 2.15.2 (2012-10-26)
> Platform: i386-w64-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
> States.1252 LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C LC_TIME=English_United
> States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] BiasedUrn_1.04 BiocInstaller_1.8.3
>
> loaded via a namespace (and not attached):
> [1] BiocGenerics_0.4.0 Biostrings_2.26.2 bitops_1.0-5
> BSgenome_1.26.1 DBI_0.2-5 GenomicRanges_1.10.5
> IRanges_1.16.4
> [8] parallel_2.15.2 RCurl_1.95-3 Rsamtools_1.10.2
> RSQLite_0.11.2 rtracklayer_1.18.1 stats4_2.15.2
> tools_2.15.2
> [15] XML_3.95-0.1 zlibbioc_1.4.0
>
>
> Thanks,
> Al
>
>
>
>
> On Sat, Dec 1, 2012 at 7:39 PM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote:
>
>> Dear Al,
>>
>> The obvious answer is the goseq package. However you have already
>> received assistance with goseq:
>>
>> https://stat.ethz.ch/**pipermail/bioconductor/2012-**
>> February/043779.html<https: stat.ethz.ch="" pipermail="" bioconductor="" 20="" 12-february="" 043779.html="">
>>
>> So if you are not trying to do a Gene Ontology analysis like goseq
does,
>> what is it that you are trying to do?
>>
>> The term "gene set enrichment analysis" was coined by the Broad
Institute:
>>
>> http://www.broadinstitute.org/**gsea/<http: www.broadinstitute.="" org="" gsea=""/>
>>
>> but you certainly can't simply give a list of genes to GSEA. It
requires
>> complete data and is designed for microarrays rather than RNA-Seq
anyway.
>>
>> Best wishes
>> Gordon
>>
>> ----------------- original message -----------------
>> [BioC] gene set enrichment
>> Alpesh Querer alpeshq at gmail.com
>> Sun Dec 2 01:27:41 CET 2012
>>
>> Hello all,
>>
>> I have list of differentially expressed genes from an rna-seq
analysis.
>> Also, I have a two-column annotation file for the organism with the
columns
>> being gene and goterm. please guide me towards a bioconductor
package or
>> any other tool that I could use my list and annotation file as
input and do
>> gene set enrichment analysis.
>>
>> Thanks,
>> Al
>>
>> ______________________________**______________________________**___
_______
>> The information in this email is confidential and
inte...{{dropped:10}}
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
Hi Steve,
Thanks for correcting me.
I said that GSEA requires full data because this is true of the published GSEA algorithm (Subramanian et al 2005). The published GSEA approach permutes arrays and therefore requires all the data. I forgot that the GSEA software provides an alternative short-cut approach (permuting genes) that can be used when there are no replicates or one just has a ranked gene list.
The GSEA ranked gene list approach is similar in principle to the geneSetTest() function in the limma package. This approach has the disadvantage that it does not correct for intra-gene correlations, as we pointed out in our recent camera paper (thanks to Tim Triche for giving the reference).
However the same criticism (that intra-gene correlation is ignored) can be made of all GO overlap analysis softwares as well including goseq. So the only clear advantage of goseq over GSEA here is the adjustment for gene length. As compensation, GSEA-ranked-list uses the rankings of the DE genes that goseq ignores.
As you probably know, the whole area of gene set testing is a hot area of research, and the inter-relationships between the many different
approaches are still imperfectly understood. Methods like geneSetTest and GSEA-ranked-list are anti-conservative. Methods like roast, camera or classic GSEA are conservative and safe. GO overlap analyses like goseq, GOStat, DAVID etc are anti-conservative in principle but, in practice, multiple testing conservatism tends to make them conservative. Different approaches test different hypotheses and emphasise different aspects of the data.
Best wishes
Gordon