gene set enrichment

0

Entering edit mode

Alpesh Querer ▴ 220

@alpesh-querer-4895

Last seen 13 months ago

United States

Hello all, I have list of differentially expressed genes from an rna-seq analysis. Also, I have a two-column annotation file for the organism with the columns being gene and goterm. please guide me towards a bioconductor package or any other tool that I could use my list and annotation file as input and do gene set enrichment analysis. Thanks, Al [[alternative HTML version deleted]]

Annotation Organism Annotation Organism • 3.8k views

ADD COMMENT • link updated 3.7 years ago by Gordon Smyth 53k • written 13.0 years ago by Alpesh Querer ▴ 220

0

Entering edit mode

Michael Salbaum ▴ 80

@michael-salbaum-5309

Last seen 11.2 years ago

If I may chime in: GSEA does work with pre-sorted gene lists; human gene names were required last time I looked. I found ranking by fold-change alone not to be satisfactory for GSEA, as this ignores the statistics outcome of a differential expression test. Ranking by (-log10(padj))*(log2(ratio)) works a bit better but still lets fold-change outliers (high fold-change but not significant) pass through. I ended up constructing my ranked gene list in four parts: 1. Statistically significant (padj derived from either DESeq or edgeR), up-regulated ranked descending by fold change 2. Not significant, expression increased or no change, ranked ascending by p value 3. Not significant, expression decreased or no change, ranked descending by p value 4. Statistically significant, down-regulated, ranked descending by fold change Not elegant, but somewhat workable; GSEA calls have to be scrutinized at the 1-2 and 3-4 boundary. Cheers, michael J. Michael Salbaum, Ph.D. Associate Professor Pennington Biomedical Research Center Louisiana State University System 6400 Perkins Road Baton Rouge, LA 70808 (225) 763-2782 -----Original Message----- From: bioconductor-bounces@r-project.org on behalf of Steve Lianoglou Sent: Sun 12/2/2012 4:41 AM To: Gordon K Smyth Cc: Bioconductor mailing list Subject: Re: [BioC] gene set enrichment Hi Gordon, When an expert comments on a topic I'm interested in, it's hard for me not to press for more insight so I hope you don't mind, but also ... you know .. take your time :-) On Sat, Dec 1, 2012 at 8:39 PM, Gordon K Smyth <smyth@wehi.edu.au> wrote: [snip] > The term "gene set enrichment analysis" was coined by the Broad Institute: > > http://www.broadinstitute.org/gsea/ > > but you certainly can't simply give a list of genes to GSEA. It requires > complete data and is designed for microarrays rather than RNA-Seq anyway. I'm curious if you say so because GSEA doesn't account for something like length bias? The GSEA folks seem to suggest that one could do this like any other "pre-processed" GSEA analysis by simply providing a ranked list of genes (presumably by fold-change): http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/FAQ# Can_I_use_GSEA_to_analyze_SNP.2C_SAGE.2C_ChIP-Seq_or_RNA-Seq_data.3F Would you mind (briefly) elaborating a bit on why you disagree? Thanks, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD COMMENT • link 13.0 years ago Michael Salbaum ▴ 80

0

Entering edit mode

Reema Singh ▴ 570

@reema-singh-4373

Last seen 11.2 years ago

Hi If you have the entrez id for your differentially expressed genes you can try GeneAnswer and ClusterProfiler. Once you have the idea (how to use these packages for GSEA) of these packages, then you would be able to utilize your go terms along with the differentailly expressed gene list for further analysis. Regards Reema Singh On Sun, Dec 2, 2012 at 5:57 AM, Alpesh Querer <alpeshq@gmail.com> wrote: > Hello all, > > I have list of differentially expressed genes from an rna-seq analysis. > Also, I have a two-column annotation file for the organism with the > columns being gene and goterm. > please guide me towards a bioconductor package or any other tool that I > could use my list and annotation file as > input and do gene set enrichment analysis. > > Thanks, > Al > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 13.0 years ago Reema Singh ▴ 570

0

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 59 minutes ago

WEHI, Melbourne, Australia

Dear Al,

The obvious answer is the goseq package. However you have already received assistance with goseq:

https://stat.ethz.ch/pipermail/bioconductor/2012-February/043779.html

So if you are not trying to do a Gene Ontology analysis like goseq does, what is it that you are trying to do?

The term "gene set enrichment analysis" was coined by the Broad Institute:

http://www.broadinstitute.org/gsea/

but you certainly can't simply give a list of genes to GSEA. It requires
complete data and is designed for microarrays rather than RNA-Seq anyway.

Best wishes
Gordon

ADD COMMENT • link 13.0 years ago • updated 7.1 years ago Gordon Smyth 53k

0

Entering edit mode

Hi Gordon, When an expert comments on a topic I'm interested in, it's hard for me not to press for more insight so I hope you don't mind, but also ... you know .. take your time :-) On Sat, Dec 1, 2012 at 8:39 PM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: [snip] > The term "gene set enrichment analysis" was coined by the Broad Institute: > > http://www.broadinstitute.org/gsea/ > > but you certainly can't simply give a list of genes to GSEA. It requires > complete data and is designed for microarrays rather than RNA-Seq anyway. I'm curious if you say so because GSEA doesn't account for something like length bias? The GSEA folks seem to suggest that one could do this like any other "pre-processed" GSEA analysis by simply providing a ranked list of genes (presumably by fold-change): http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/FAQ# Can_I_use_GSEA_to_analyze_SNP.2C_SAGE.2C_ChIP-Seq_or_RNA-Seq_data.3F Would you mind (briefly) elaborating a bit on why you disagree? Thanks, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 13.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Perhaps Dr. Smyth is referring to the uncorrected Type I inflation that can be introduced by correlation within gene sets, and which seems to remain uncorrected in typical gene set analyses? Di Wu wrote a nice paper on this, centered on the 'camera' function, which indicated that severe type I inflation could be reined in by empirically correcting for the correlation within sets. http://nar.oxfordjournals.org/content/40/17/e133 I am not an expert but I found the paper interesting, moreso in light of papers from Rick Young's lab at the Whitehead Institute which, in so many words, suggest that widespread transcription amplification by (e.g.) c-Myc may render many assumptions underlying quantile normalization invalid. It would seem that many assumptions from microarray analysis are due for re-examination if my observations are not far off base. But, I am not an expert and would love to hear from those who are. On Sun, Dec 2, 2012 at 2:41 AM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hi Gordon, > > When an expert comments on a topic I'm interested in, it's hard for me > not to press for more insight so I hope you don't mind, but also ... > you know .. take your time :-) > > On Sat, Dec 1, 2012 at 8:39 PM, Gordon K Smyth <smyth@wehi.edu.au> wrote: > [snip] > > The term "gene set enrichment analysis" was coined by the Broad > Institute: > > > > http://www.broadinstitute.org/gsea/ > > > > but you certainly can't simply give a list of genes to GSEA. It requires > > complete data and is designed for microarrays rather than RNA-Seq anyway. > > I'm curious if you say so because GSEA doesn't account for something > like length bias? The GSEA folks seem to suggest that one could do > this like any other "pre-processed" GSEA analysis by simply providing > a ranked list of genes (presumably by fold-change): > > > http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/FA Q#Can_I_use_GSEA_to_analyze_SNP.2C_SAGE.2C_ChIP-Seq_or_RNA-Seq_data.3F > > Would you mind (briefly) elaborating a bit on why you disagree? > > Thanks, > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD REPLY • link 13.0 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Hi Steve,

Thanks for correcting me.

I said that GSEA requires full data because this is true of the published GSEA algorithm (Subramanian et al 2005). The published GSEA approach permutes arrays and therefore requires all the data. I forgot that the GSEA software provides an alternative short-cut approach (permuting genes) that can be used when there are no replicates or one just has a ranked gene list.

The GSEA ranked gene list approach is similar in principle to the geneSetTest() function in the limma package. This approach has the disadvantage that it does not correct for intra-gene correlations, as we pointed out in our recent camera paper (thanks to Tim Triche for giving the reference).

However the same criticism (that intra-gene correlation is ignored) can be made of all GO overlap analysis softwares as well including goseq. So the only clear advantage of goseq over GSEA here is the adjustment for gene length. As compensation, GSEA-ranked-list uses the rankings of the DE genes that goseq ignores.

As you probably know, the whole area of gene set testing is a hot area of research, and the inter-relationships between the many different
approaches are still imperfectly understood. Methods like geneSetTest and GSEA-ranked-list are anti-conservative. Methods like roast, camera or classic GSEA are conservative and safe. GO overlap analyses like goseq, GOStat, DAVID etc are anti-conservative in principle but, in practice, multiple testing conservatism tends to make them conservative. Different approaches test different hypotheses and emphasise different aspects of the data.

Best wishes
Gordon

ADD REPLY • link 13.0 years ago • updated 7.1 years ago Gordon Smyth 53k

0

Entering edit mode

WATSON Mick ▴ 50

@watson-mick-5575

Last seen 10.8 years ago

United Kingdom

The function phyper() can help you with this. We also have a package called CORNA (http://corna.sourceforge.net/tutorial.html) that might help, but this needs to be updated for the latest version of R. Mick -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -----Original Message----- From: bioconductor-bounces@r-project.org [mailto:bioconductor- bounces@r-project.org] On Behalf Of Alpesh Querer Sent: 02 December 2012 00:28 To: Bioconductor mailing list Subject: [BioC] gene set enrichment Hello all, I have list of differentially expressed genes from an rna-seq analysis. Also, I have a two-column annotation file for the organism with the columns being gene and goterm. please guide me towards a bioconductor package or any other tool that I could use my list and annotation file as input and do gene set enrichment analysis. Thanks, Al [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 13.0 years ago WATSON Mick ▴ 50

0

Entering edit mode

Alpesh Querer ▴ 220

@alpesh-querer-4895

Last seen 13 months ago

United States

Thanks Gordon. I was trying to install the latest version of R and goseq, but it wouldn't load anymore. do you have an insight on why this would happen? maybe i`m doing something not right. > biocLite("goseq") BioC_mirror: http://bioconductor.org Using Bioconductor version 2.11 (BiocInstaller 1.8.3), R version 2.15. Installing package(s) 'goseq' trying URL ' http://bioconductor.org/packages/2.11/bioc/bin/windows/contrib/2.15/go seq_1.10.0.zip ' Content type 'application/zip' length 751702 bytes (734 Kb) opened URL downloaded 734 Kb package goseq successfully unpacked and MD5 sums checked > library(goseq) Loading required package: BiasedUrn Loading required package: geneLenDataBase Error in loadNamespace(i[[1L]], c(lib.loc, .libPaths())) : there is no package called Biobase Error: package geneLenDataBase could not be loaded > sessionInfo() R version 2.15.2 (2012-10-26) Platform: i386-w64-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BiasedUrn_1.04 BiocInstaller_1.8.3 loaded via a namespace (and not attached): [1] BiocGenerics_0.4.0 Biostrings_2.26.2 bitops_1.0-5 BSgenome_1.26.1 DBI_0.2-5 GenomicRanges_1.10.5 IRanges_1.16.4 [8] parallel_2.15.2 RCurl_1.95-3 Rsamtools_1.10.2 RSQLite_0.11.2 rtracklayer_1.18.1 stats4_2.15.2 tools_2.15.2 [15] XML_3.95-0.1 zlibbioc_1.4.0 Thanks, Al On Sat, Dec 1, 2012 at 7:39 PM, Gordon K Smyth <smyth@wehi.edu.au> wrote: > Dear Al, > > The obvious answer is the goseq package. However you have already > received assistance with goseq: > > https://stat.ethz.ch/**pipermail/bioconductor/2012-** > February/043779.html<https: stat.ethz.ch="" pipermail="" bioconductor="" 201="" 2-february="" 043779.html=""> > > So if you are not trying to do a Gene Ontology analysis like goseq does, > what is it that you are trying to do? > > The term "gene set enrichment analysis" was coined by the Broad Institute: > > http://www.broadinstitute.org/**gsea/<http: www.broadinstitute.or="" g="" gsea=""/> > > but you certainly can't simply give a list of genes to GSEA. It requires > complete data and is designed for microarrays rather than RNA-Seq anyway. > > Best wishes > Gordon > > ----------------- original message ----------------- > [BioC] gene set enrichment > Alpesh Querer alpeshq at gmail.com > Sun Dec 2 01:27:41 CET 2012 > > Hello all, > > I have list of differentially expressed genes from an rna-seq analysis. > Also, I have a two-column annotation file for the organism with the columns > being gene and goterm. please guide me towards a bioconductor package or > any other tool that I could use my list and annotation file as input and do > gene set enrichment analysis. > > Thanks, > Al > > ______________________________**______________________________**____ ______ > The information in this email is confidential and inte...{{dropped:10}}

ADD COMMENT • link 13.0 years ago Alpesh Querer ▴ 220

0

Entering edit mode

On 12/03/2012 09:05 AM, Alpesh Querer wrote: > Thanks Gordon. > > I was trying to install the latest version of R and goseq, but it wouldn't > load anymore. > do you have an insight on why this would happen? maybe i`m doing something > not right. > > >> biocLite("goseq") > BioC_mirror: http://bioconductor.org > Using Bioconductor version 2.11 (BiocInstaller 1.8.3), R version 2.15. > Installing package(s) 'goseq' > trying URL ' > http://bioconductor.org/packages/2.11/bioc/bin/windows/contrib/2.15/ goseq_1.10.0.zip > ' > Content type 'application/zip' length 751702 bytes (734 Kb) > opened URL > downloaded 734 Kb > > package ?goseq? successfully unpacked and MD5 sums checked > > >> library(goseq) > Loading required package: BiasedUrn > Loading required package: geneLenDataBase > Error in loadNamespace(i[[1L]], c(lib.loc, .libPaths())) : > there is no package called ?Biobase? > Error: package ?geneLenDataBase? could not be loaded geneLenDataBase (?) seems to be missing a dependency. Try biocLite("Biobase") first. Martin > > >> sessionInfo() > R version 2.15.2 (2012-10-26) > Platform: i386-w64-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United > States.1252 LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C LC_TIME=English_United > States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] BiasedUrn_1.04 BiocInstaller_1.8.3 > > loaded via a namespace (and not attached): > [1] BiocGenerics_0.4.0 Biostrings_2.26.2 bitops_1.0-5 > BSgenome_1.26.1 DBI_0.2-5 GenomicRanges_1.10.5 > IRanges_1.16.4 > [8] parallel_2.15.2 RCurl_1.95-3 Rsamtools_1.10.2 > RSQLite_0.11.2 rtracklayer_1.18.1 stats4_2.15.2 > tools_2.15.2 > [15] XML_3.95-0.1 zlibbioc_1.4.0 > > > Thanks, > Al > > > > > On Sat, Dec 1, 2012 at 7:39 PM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > >> Dear Al, >> >> The obvious answer is the goseq package. However you have already >> received assistance with goseq: >> >> https://stat.ethz.ch/**pipermail/bioconductor/2012-** >> February/043779.html<https: stat.ethz.ch="" pipermail="" bioconductor="" 20="" 12-february="" 043779.html=""> >> >> So if you are not trying to do a Gene Ontology analysis like goseq does, >> what is it that you are trying to do? >> >> The term "gene set enrichment analysis" was coined by the Broad Institute: >> >> http://www.broadinstitute.org/**gsea/<http: www.broadinstitute.="" org="" gsea=""/> >> >> but you certainly can't simply give a list of genes to GSEA. It requires >> complete data and is designed for microarrays rather than RNA-Seq anyway. >> >> Best wishes >> Gordon >> >> ----------------- original message ----------------- >> [BioC] gene set enrichment >> Alpesh Querer alpeshq at gmail.com >> Sun Dec 2 01:27:41 CET 2012 >> >> Hello all, >> >> I have list of differentially expressed genes from an rna-seq analysis. >> Also, I have a two-column annotation file for the organism with the columns >> being gene and goterm. please guide me towards a bioconductor package or >> any other tool that I could use my list and annotation file as input and do >> gene set enrichment analysis. >> >> Thanks, >> Al >> >> ______________________________**______________________________**___ _______ >> The information in this email is confidential and inte...{{dropped:10}} > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 13.0 years ago Martin Morgan 25k

0

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 59 minutes ago

WEHI, Melbourne, Australia

Dear Alpesh,

Please keep questions on the Bioconductor mailing list.

The error message says "there is no package called Biobase", which tells you that Biobase is required but you haven't installed it.

Best wishes
Gordon

ADD COMMENT • link 13.0 years ago • updated 3.7 years ago Gordon Smyth 53k

Login before adding your answer.