Question: Simple pathway enrichment analysis for gene lists
0
5.5 years ago by
enricoferrero570
Switzerland
enricoferrero570 wrote:
Dear list, Can anybody suggest how to perform a simple pathway enrichment analysis starting from a list of gene IDs? I know about the gage and ROntoTools packages that use KEGGREST to retrieve an up to date version of the KEGG database, but, as far as I understand, they require a microarray experiment as input (or at least fold changes and pvalues). Since this time around I'm not starting from a microarray experiment but I just have a gene list, I'm looking for a way to perform pathway enrichment analysis using a simple numerical method such as Fisher's / hypergeometric test. I know the Category package still provides a KEGGHyperG class (which would be perfect!), but the results are based on the outdated version of KEGG (via KEGG.db, I guess). Are there any good alternatives available out there? Would it be possible to use reactome.db in conjunction with the Category/GOstats functions for example? Thank you! Best, -- Enrico Ferrero Department of Genetics Cambridge Systems Biology Centre University of Cambridge
modified 5.5 years ago by Paul Shannon370 • written 5.5 years ago by enricoferrero570
Answer: Simple pathway enrichment analysis for gene lists
0
5.5 years ago by
Paul Shannon370
Paul Shannon370 wrote:
Hi Enrico, reactome.db is best described, I believe, as simply a bioc rendering of Reactome's sql database -- Marc, please correct me if I am wrong. There is thus a data representation obstacle when using reactome.db: molecular relations are described, with pathway/gene mappings not so easy to get at. In addition, and despite Reactome's many strengths, its coverage is incomplete. The canonical wnt pathway, for instance, is (at my last check) not included. If you have a list of geneIDs, exploratory analysis can usefully start out with both GO enrichment, KEGG enrichment, and GSEA. Though the information in KEGG.db has not been updated in a couple of years, the information there is still very useful for exploratory data analysis. Any enrichments you discover using these assorted gene/ctaegory associations may lead you to a close study of particular functions or pathways, and it is this point that you may wish to get the latest and most specific information via KEGGREST and Reactome (and, with our next release) the new PSICQUIC package (see http://code.google.com/p/psicquic/). I hope this helps. Let us know if it falls short, or if new questions arise. - Paul On Sep 9, 2013, at 7:53 AM, Enrico Ferrero wrote: > Dear list, > > Can anybody suggest how to perform a simple pathway enrichment > analysis starting from a list of gene IDs? > > I know about the gage and ROntoTools packages that use KEGGREST to > retrieve an up to date version of the KEGG database, but, as far as I > understand, they require a microarray experiment as input (or at least > fold changes and pvalues). > > Since this time around I'm not starting from a microarray experiment > but I just have a gene list, I'm looking for a way to perform pathway > enrichment analysis using a simple numerical method such as Fisher's / > hypergeometric test. > > I know the Category package still provides a KEGGHyperG class (which > would be perfect!), but the results are based on the outdated version > of KEGG (via KEGG.db, I guess). > > Are there any good alternatives available out there? Would it be > possible to use reactome.db in conjunction with the Category/GOstats > functions for example? > > Thank you! > Best, > > -- > Enrico Ferrero > Department of Genetics > Cambridge Systems Biology Centre > University of Cambridge > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
On 09/09/2013 08:25 AM, Paul Shannon wrote: > Hi Enrico, > > reactome.db is best described, I believe, as simply a bioc rendering of Reactome's sql database -- Marc, please correct me if I am wrong. Basically yes, but there is still information in there for connecting gene IDs to reactome pathway IDs. And we put some effort into making this easier to use than it would be by default. So for example if you have an entrez gene ID "10898" you can look it up like this: select(reactome.db, 10898, "PATHID", "ENTREZID") And yes it is definitely incomplete in terms of pathway representation. And it is probably important to bear in mind that every large pathway database should be expected to be terribly incomplete. In fact if you compare them you will see that different databases overlap a lot less than you might have hoped. And even if they overlapped perfectly, they would still be woefully incomplete simply because our knowledge of pathways is far from complete. So you should be careful to never assume (for example) that what is in the database is complete, and to be honest even assuming that what is there is somehow "representative" may be a stretch in this case. Marc > There is thus a data representation obstacle when using reactome.db: molecular relations are described, with pathway/gene mappings not so easy to get at. > In addition, and despite Reactome's many strengths, its coverage is incomplete. The canonical wnt pathway, for instance, is (at my last check) not included. > > If you have a list of geneIDs, exploratory analysis can usefully start out with both GO enrichment, KEGG enrichment, and GSEA. Though the information in KEGG.db has not been updated in a couple of years, the information there is still very useful for exploratory data analysis. Any enrichments you discover using these assorted gene/ctaegory associations may lead you to a close study of particular functions or pathways, and it is this point that you may wish to get the latest and most specific information via KEGGREST and Reactome (and, with our next release) the new PSICQUIC package (see http://code.google.com/p/psicquic/). > > I hope this helps. Let us know if it falls short, or if new questions arise. > > - Paul > > > On Sep 9, 2013, at 7:53 AM, Enrico Ferrero wrote: > >> Dear list, >> >> Can anybody suggest how to perform a simple pathway enrichment >> analysis starting from a list of gene IDs? >> >> I know about the gage and ROntoTools packages that use KEGGREST to >> retrieve an up to date version of the KEGG database, but, as far as I >> understand, they require a microarray experiment as input (or at least >> fold changes and pvalues). >> >> Since this time around I'm not starting from a microarray experiment >> but I just have a gene list, I'm looking for a way to perform pathway >> enrichment analysis using a simple numerical method such as Fisher's / >> hypergeometric test. >> >> I know the Category package still provides a KEGGHyperG class (which >> would be perfect!), but the results are based on the outdated version >> of KEGG (via KEGG.db, I guess). >> >> Are there any good alternatives available out there? Would it be >> possible to use reactome.db in conjunction with the Category/GOstats >> functions for example? >> >> Thank you! >> Best, >> >> -- >> Enrico Ferrero >> Department of Genetics >> Cambridge Systems Biology Centre >> University of Cambridge >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
Hi Enrico, You ask good questions. In answer to your immediate need, pasted in below are two versions of a short KEGG pathway enrichment script. The first emphasizes simplicity and clarity, and uses only the stock R "dhyper" function along with KEGGREST. The second version (courtesy of Martin, who is also responsible for some of the approaches I use in my simpler version) produces a richer data.frame result. Inasmuch as these examples use KEGGREST, you benefit from the current curated data at KEGG. As you point out, KEGG.db will only become increasingly out of date. KEGGREST gets gene/pathway memberships dynamically, avoiding the KEGG prohibition on full downloads. We offer these two KEGGREST:dhypher/hyperg examples informally, and in hopes of encouraging discussion about what Bioconductor should offer more formally. It would be illuminating to run the same demo using pathways from Reactome. Perhaps you could contribute such a demo? Please let us know what you think. - Paul ----- simpler version library(KEGGREST) library(org.Hs.eg.db) # created named list, eg: path:map00010: "Glycolysis / Gluconeogenesis" pathways.list <- head(keggList("pathway", "hsa"), n=10) # make them into KEGG-style human pathway identifiers pathway.codes <- sub("path:", "", names(pathways.list)) # for demonstration, just use the first ten pathways # not all pathways exist for human, so TODO: tryCatch the # keggGet to be robust against those failures # subsetting by c(TRUE, FALSE) -- which repeats # as many times as needed, sorts through some # unexpected packaging of geneIDs in the GENE element # of each pw[[n]] genes.by.pathway <- sapply(pathway.codes, function(pwid){ pw <- keggGet(pwid) pw[[1]]$GENE[c(TRUE, FALSE)] }) all.geneIDs <- keys(org.Hs.eg.db) # chose one of these for demonstration. the first (a whole genome random # set of 100 genes) has very little enrichment, the second, a random set # from the pathways themsevles, has very good enrichment genes.of.interest <- c("23118", "23119", "23304", "25998", "26001", "51043", "55632", "55643", "55743", "55870", "7314", "56254", "7316", "144193","784", "8837", "1111", "84706", "200931","169522","5707", "5091", "5901", "55532", "9777") # the hypergeometric distribution is traditionally explained in terms of # drawing a sample of balls from an urn containing black and white balls. # to keep the arguments straight (in my mind at least), I use these terms # here also pVals.by.pathway <- sapply(names(genes.by.pathway), function(pathway) { pathway.genes <- genes.by.pathway[[pathway]] white.balls.drawn <- length(intersect(genes.of.interest, pathway.genes)) white.balls.in.urn <- length(pathway.genes) total.balls.in.urn <- length(all.geneIDs) black.balls.in.urn <- total.balls.in.urn - white.balls.in.urn total.balls.drawn.from.urn <- length(genes.of.interest) dhyper(white.balls.drawn, white.balls.in.urn, black.balls.in.urn, total.balls.drawn.from.urn) }) print(pVals.by.pathway) ---- richer version library(KEGGREST) library(org.Hs.eg.db) # created named list, length 449, eg: # path:hsa00010: "Glycolysis / Gluconeogenesis" pathways <- keggList("pathway", "hsa") # make them into KEGG-style human pathway identifiers human.pathways <- sub("path:", "", names(pathways)) # for demonstration, just use the first ten pathways demo.pathway.ids <- head(human.pathways, 10) demo.pathways <- setNames(keggGet(demo.pathway.ids), demo.pathway.ids) genes.by.pathway <- lapply(demo.pathways, function(demo.pathway) { demo.pathway$GENE[c(TRUE, FALSE)] }) all.geneIDs <- keys(org.Hs.eg.db) # chose one of these for demonstration. the first (a whole genome random # set of 100 genes) has very little enrichment, the second, a random set # from the pathways themselves, has very good enrichment in some pathways set.seed(123) significant.genes <- sample(all.geneIDs, size=100) #significant.genes <- sample(unique(unlist(genes.by.pathway)), size=10) # the hypergeometric distribution is traditionally explained in terms of # drawing a sample of balls from an urn containing black and white balls. # to keep the arguments straight (in my mind at least), I use these terms # here also hyperg <- Category:::.doHyperGInternal hyperg.test <- function(pathway.genes, significant.genes, all.genes, over=TRUE) { white.balls.drawn <- length(intersect(significant.genes, pathway.genes)) white.balls.in.urn <- length(pathway.genes) total.balls.in.urn <- length(all.genes) black.balls.in.urn <- total.balls.in.urn - white.balls.in.urn balls.pulled.from.urn <- length(significant.genes) hyperg(white.balls.in.urn, black.balls.in.urn, balls.pulled.from.urn, white.balls.drawn, over) } pVals.by.pathway <- t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs)) print(pVals.by.pathway) On Sep 10, 2013, at 9:23 AM, Enrico Ferrero wrote: > Hi Paul, > > Thanks for your suggestions. > I already performed a GO enrichment analysis and I was specifically > looking at the enrichment of pathways. You also suggest GSEA, but, > correct me if I'm wrong, I think I would need some additional value to > rank my list of gene IDs, as the GSEA algorithm requires a ranked list > to start with. > I guess at the moment my best option is to use Category and the > outdated KEGG.db to have an idea of the enriched pathways and then use > some non-Bioconductor tool (any suggestion?). > > I'm not sure this is the right place, but, on a more general note, may > I ask if there are any plans to provide better and more integrated > support for pathway enrichment analysis in Bioconductor? > For example, would it be possible to build something similar to an up > to date KEGG.db using KEGG's REST API? After all, packages such as > gage and ROntoTools get their information from KEGG in that way. > Alternatively, while not as comprehensive as KEGG, Reactome is > completely open source and open access and I would be surprised if > there wasn't a more fruitful way to integrate it with the core > Bioconductor tools an packages. > > I'm probably being naive here, but I'm afraid I don't fully understand > what is the general consensus of the Bioconductor community on current > status and future directions of pathway analysis tools. > > Thank you. > Best, > > On 9 September 2013 16:25, Paul Shannon <paul.thurmond.shannon at="" gmail.com=""> wrote: >> Hi Enrico, >> >> reactome.db is best described, I believe, as simply a bioc rendering of Reactome's sql database -- Marc, please correct me if I am wrong. >> >> There is thus a data representation obstacle when using reactome.db: molecular relations are described, with pathway/gene mappings not so easy to get at. >> In addition, and despite Reactome's many strengths, its coverage is incomplete. The canonical wnt pathway, for instance, is (at my last check) not included. >> >> If you have a list of geneIDs, exploratory analysis can usefully start out with both GO enrichment, KEGG enrichment, and GSEA. Though the information in KEGG.db has not been updated in a couple of years, the information there is still very useful for exploratory data analysis. Any enrichments you discover using these assorted gene/ctaegory associations may lead you to a close study of particular functions or pathways, and it is this point that you may wish to get the latest and most specific information via KEGGREST and Reactome (and, with our next release) the new PSICQUIC package (see http://code.google.com/p/psicquic/). >> >> I hope this helps. Let us know if it falls short, or if new questions arise. >> >> - Paul >> >> >> On Sep 9, 2013, at 7:53 AM, Enrico Ferrero wrote: >> >>> Dear list, >>> >>> Can anybody suggest how to perform a simple pathway enrichment >>> analysis starting from a list of gene IDs? >>> >>> I know about the gage and ROntoTools packages that use KEGGREST to >>> retrieve an up to date version of the KEGG database, but, as far as I >>> understand, they require a microarray experiment as input (or at least >>> fold changes and pvalues). >>> >>> Since this time around I'm not starting from a microarray experiment >>> but I just have a gene list, I'm looking for a way to perform pathway >>> enrichment analysis using a simple numerical method such as Fisher's / >>> hypergeometric test. >>> >>> I know the Category package still provides a KEGGHyperG class (which >>> would be perfect!), but the results are based on the outdated version >>> of KEGG (via KEGG.db, I guess). >>> >>> Are there any good alternatives available out there? Would it be >>> possible to use reactome.db in conjunction with the Category/GOstats >>> functions for example? >>> >>> Thank you! >>> Best, >>> >>> -- >>> Enrico Ferrero >>> Department of Genetics >>> Cambridge Systems Biology Centre >>> University of Cambridge >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > Enrico Ferrero > Department of Genetics > Cambridge Systems Biology Centre > University of Cambridge
One could also use KEGG pathways from MSigDB (or any other gene set collection; http://www.broadinstitute.org/gsea/msigdb/collections.jsp registration required and provenance unclear [to me, on quick look]) with > library(GSEABase) > sets <- getGmt("~/Downloads/c2.cp.kegg.v4.0.entrez.gmt") > genes.by.pathway <- lapply(sets, geneIds) > names(genes.by.pathway) <- names(sets) and then as below. Martin On 09/11/2013 09:05 AM, Paul Shannon wrote: > Hi Enrico, > > You ask good questions. > > In answer to your immediate need, pasted in below are two versions of a short KEGG pathway enrichment script. > The first emphasizes simplicity and clarity, and uses only the stock R "dhyper" function along with KEGGREST. > The second version (courtesy of Martin, who is also responsible for some of the approaches I use in my simpler version) > produces a richer data.frame result. > > Inasmuch as these examples use KEGGREST, you benefit from the current curated data at KEGG. As you point out, KEGG.db will only become increasingly out of date. KEGGREST gets gene/pathway memberships dynamically, avoiding the KEGG prohibition on full downloads. > > We offer these two KEGGREST:dhypher/hyperg examples informally, and in hopes of encouraging discussion about what Bioconductor should offer more formally. > > It would be illuminating to run the same demo using pathways from Reactome. Perhaps you could contribute such a demo? > > Please let us know what you think. > > - Paul > > ----- simpler version > library(KEGGREST) > library(org.Hs.eg.db) > > # created named list, eg: path:map00010: "Glycolysis / Gluconeogenesis" > > pathways.list <- head(keggList("pathway", "hsa"), n=10) > > # make them into KEGG-style human pathway identifiers > pathway.codes <- sub("path:", "", names(pathways.list)) > > # for demonstration, just use the first ten pathways > # not all pathways exist for human, so TODO: tryCatch the > # keggGet to be robust against those failures > > # subsetting by c(TRUE, FALSE) -- which repeats > # as many times as needed, sorts through some > # unexpected packaging of geneIDs in the GENE element > # of each pw[[n]] > > genes.by.pathway <- sapply(pathway.codes, > function(pwid){ > pw <- keggGet(pwid) > pw[[1]]$GENE[c(TRUE, FALSE)] > }) > > all.geneIDs <- keys(org.Hs.eg.db) > > # chose one of these for demonstration. the first (a whole genome random > # set of 100 genes) has very little enrichment, the second, a random set > # from the pathways themsevles, has very good enrichment > > genes.of.interest <- c("23118", "23119", "23304", "25998", "26001", "51043", > "55632", "55643", "55743", "55870", "7314", "56254", > "7316", "144193","784", "8837", "1111", "84706", > "200931","169522","5707", "5091", "5901", "55532", > "9777") > > # the hypergeometric distribution is traditionally explained in terms of > # drawing a sample of balls from an urn containing black and white balls. > # to keep the arguments straight (in my mind at least), I use these terms > # here also > > pVals.by.pathway <- sapply(names(genes.by.pathway), > function(pathway) { > pathway.genes <- genes.by.pathway[[pathway]] > white.balls.drawn <- length(intersect(genes.of.interest, pathway.genes)) > white.balls.in.urn <- length(pathway.genes) > total.balls.in.urn <- length(all.geneIDs) > black.balls.in.urn <- total.balls.in.urn - white.balls.in.urn > total.balls.drawn.from.urn <- length(genes.of.interest) > dhyper(white.balls.drawn, > white.balls.in.urn, > black.balls.in.urn, > total.balls.drawn.from.urn) > }) > > print(pVals.by.pathway) > > ---- richer version > > library(KEGGREST) > library(org.Hs.eg.db) > > # created named list, length 449, eg: > # path:hsa00010: "Glycolysis / Gluconeogenesis" > > pathways <- keggList("pathway", "hsa") > > # make them into KEGG-style human pathway identifiers > human.pathways <- sub("path:", "", names(pathways)) > > # for demonstration, just use the first ten pathways > > demo.pathway.ids <- head(human.pathways, 10) > demo.pathways <- setNames(keggGet(demo.pathway.ids), demo.pathway.ids) > > genes.by.pathway <- lapply(demo.pathways, function(demo.pathway) { > demo.pathway$GENE[c(TRUE, FALSE)] > }) > > all.geneIDs <- keys(org.Hs.eg.db) > > # chose one of these for demonstration. the first (a whole genome random > # set of 100 genes) has very little enrichment, the second, a random set > # from the pathways themselves, has very good enrichment in some pathways > > set.seed(123) > significant.genes <- sample(all.geneIDs, size=100) > #significant.genes <- sample(unique(unlist(genes.by.pathway)), size=10) > > # the hypergeometric distribution is traditionally explained in terms of > # drawing a sample of balls from an urn containing black and white balls. > # to keep the arguments straight (in my mind at least), I use these terms > # here also > > hyperg <- Category:::.doHyperGInternal > hyperg.test <- > function(pathway.genes, significant.genes, all.genes, over=TRUE) > { > white.balls.drawn <- length(intersect(significant.genes, pathway.genes)) > white.balls.in.urn <- length(pathway.genes) > total.balls.in.urn <- length(all.genes) > black.balls.in.urn <- total.balls.in.urn - white.balls.in.urn > balls.pulled.from.urn <- length(significant.genes) > hyperg(white.balls.in.urn, black.balls.in.urn, > balls.pulled.from.urn, white.balls.drawn, over) > } > > pVals.by.pathway <- > t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs)) > > print(pVals.by.pathway) > > > On Sep 10, 2013, at 9:23 AM, Enrico Ferrero wrote: > >> Hi Paul, >> >> Thanks for your suggestions. >> I already performed a GO enrichment analysis and I was specifically >> looking at the enrichment of pathways. You also suggest GSEA, but, >> correct me if I'm wrong, I think I would need some additional value to >> rank my list of gene IDs, as the GSEA algorithm requires a ranked list >> to start with. >> I guess at the moment my best option is to use Category and the >> outdated KEGG.db to have an idea of the enriched pathways and then use >> some non-Bioconductor tool (any suggestion?). >> >> I'm not sure this is the right place, but, on a more general note, may >> I ask if there are any plans to provide better and more integrated >> support for pathway enrichment analysis in Bioconductor? >> For example, would it be possible to build something similar to an up >> to date KEGG.db using KEGG's REST API? After all, packages such as >> gage and ROntoTools get their information from KEGG in that way. >> Alternatively, while not as comprehensive as KEGG, Reactome is >> completely open source and open access and I would be surprised if >> there wasn't a more fruitful way to integrate it with the core >> Bioconductor tools an packages. >> >> I'm probably being naive here, but I'm afraid I don't fully understand >> what is the general consensus of the Bioconductor community on current >> status and future directions of pathway analysis tools. >> >> Thank you. >> Best, >> >> On 9 September 2013 16:25, Paul Shannon <paul.thurmond.shannon at="" gmail.com=""> wrote: >>> Hi Enrico, >>> >>> reactome.db is best described, I believe, as simply a bioc rendering of Reactome's sql database -- Marc, please correct me if I am wrong. >>> >>> There is thus a data representation obstacle when using reactome.db: molecular relations are described, with pathway/gene mappings not so easy to get at. >>> In addition, and despite Reactome's many strengths, its coverage is incomplete. The canonical wnt pathway, for instance, is (at my last check) not included. >>> >>> If you have a list of geneIDs, exploratory analysis can usefully start out with both GO enrichment, KEGG enrichment, and GSEA. Though the information in KEGG.db has not been updated in a couple of years, the information there is still very useful for exploratory data analysis. Any enrichments you discover using these assorted gene/ctaegory associations may lead you to a close study of particular functions or pathways, and it is this point that you may wish to get the latest and most specific information via KEGGREST and Reactome (and, with our next release) the new PSICQUIC package (see http://code.google.com/p/psicquic/). >>> >>> I hope this helps. Let us know if it falls short, or if new questions arise. >>> >>> - Paul >>> >>> >>> On Sep 9, 2013, at 7:53 AM, Enrico Ferrero wrote: >>> >>>> Dear list, >>>> >>>> Can anybody suggest how to perform a simple pathway enrichment >>>> analysis starting from a list of gene IDs? >>>> >>>> I know about the gage and ROntoTools packages that use KEGGREST to >>>> retrieve an up to date version of the KEGG database, but, as far as I >>>> understand, they require a microarray experiment as input (or at least >>>> fold changes and pvalues). >>>> >>>> Since this time around I'm not starting from a microarray experiment >>>> but I just have a gene list, I'm looking for a way to perform pathway >>>> enrichment analysis using a simple numerical method such as Fisher's / >>>> hypergeometric test. >>>> >>>> I know the Category package still provides a KEGGHyperG class (which >>>> would be perfect!), but the results are based on the outdated version >>>> of KEGG (via KEGG.db, I guess). >>>> >>>> Are there any good alternatives available out there? Would it be >>>> possible to use reactome.db in conjunction with the Category/GOstats >>>> functions for example? >>>> >>>> Thank you! >>>> Best, >>>> >>>> -- >>>> Enrico Ferrero >>>> Department of Genetics >>>> Cambridge Systems Biology Centre >>>> University of Cambridge >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> -- >> Enrico Ferrero >> Department of Genetics >> Cambridge Systems Biology Centre >> University of Cambridge > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
Hi Paul, Thank you very much, your response is helpful and full of insights. Personally, I think it'd be great if a similar script could be wrapped into a function and included into a new or existing package, but I'd like to hear what other people think. For Reactome, I think a good alternative could be using the ReactomePA package which does just that, I don't know how I didn't notice it earlier. Another resource which in my opinion should be considered for future applications is PathwayCommons [1] which is freely accessible and already gathers a considerable set of pathways from different sources. It's basically what Cytoscape uses for pathway retrieval so it should have a programmer-friendly access interface. Best, [1] http://www.pathwaycommons.org On 11 September 2013 18:00, Martin Morgan <mtmorgan@fhcrc.org> wrote: > One could also use KEGG pathways from MSigDB (or any other gene set > collection; > > http://www.broadinstitute.org/**gsea/msigdb/collections.jsp<http: www.broadinstitute.org="" gsea="" msigdb="" collections.jsp=""> > > registration required and provenance unclear [to me, on quick look]) with > > > library(GSEABase) > > sets <- getGmt("~/Downloads/c2.cp.**kegg.v4.0.entrez.gmt") > > genes.by.pathway <- lapply(sets, geneIds) > > names(genes.by.pathway) <- names(sets) > > and then as below. > > Martin > > > On 09/11/2013 09:05 AM, Paul Shannon wrote: > >> Hi Enrico, >> >> You ask good questions. >> >> In answer to your immediate need, pasted in below are two versions of a >> short KEGG pathway enrichment script. >> The first emphasizes simplicity and clarity, and uses only the stock R >> "dhyper" function along with KEGGREST. >> The second version (courtesy of Martin, who is also responsible for some >> of the approaches I use in my simpler version) >> produces a richer data.frame result. >> >> Inasmuch as these examples use KEGGREST, you benefit from the current >> curated data at KEGG. As you point out, KEGG.db will only become >> increasingly out of date. KEGGREST gets gene/pathway memberships >> dynamically, avoiding the KEGG prohibition on full downloads. >> >> We offer these two KEGGREST:dhypher/hyperg examples informally, and in >> hopes of encouraging discussion about what Bioconductor should offer more >> formally. >> >> It would be illuminating to run the same demo using pathways from >> Reactome. Perhaps you could contribute such a demo? >> >> Please let us know what you think. >> >> - Paul >> >> ----- simpler version >> library(KEGGREST) >> library(org.Hs.eg.db) >> >> # created named list, eg: path:map00010: "Glycolysis / >> Gluconeogenesis" >> >> pathways.list <- head(keggList("pathway", "hsa"), n=10) >> >> # make them into KEGG-style human pathway identifiers >> pathway.codes <- sub("path:", "", names(pathways.list)) >> >> # for demonstration, just use the first ten pathways >> # not all pathways exist for human, so TODO: tryCatch the >> # keggGet to be robust against those failures >> >> # subsetting by c(TRUE, FALSE) -- which repeats >> # as many times as needed, sorts through some >> # unexpected packaging of geneIDs in the GENE element >> # of each pw[[n]] >> >> genes.by.pathway <- sapply(pathway.codes, >> function(pwid){ >> pw <- keggGet(pwid) >> pw[[1]]$GENE[c(TRUE, FALSE)] >> }) >> >> all.geneIDs <- keys(org.Hs.eg.db) >> >> # chose one of these for demonstration. the first (a whole genome >> random >> # set of 100 genes) has very little enrichment, the second, a random >> set >> # from the pathways themsevles, has very good enrichment >> >> genes.of.interest <- c("23118", "23119", "23304", "25998", "26001", >> "51043", >> "55632", "55643", "55743", "55870", "7314", >> "56254", >> "7316", "144193","784", "8837", "1111", >> "84706", >> "200931","169522","5707", "5091", "5901", >> "55532", >> "9777") >> >> # the hypergeometric distribution is traditionally explained in terms >> of >> # drawing a sample of balls from an urn containing black and white >> balls. >> # to keep the arguments straight (in my mind at least), I use these >> terms >> # here also >> >> pVals.by.pathway <- sapply(names(genes.by.pathway)**, >> function(pathway) { >> pathway.genes <- genes.by.pathway[[pathway]] >> white.balls.drawn <- >> length(intersect(genes.of.**interest, pathway.genes)) >> white.balls.in.urn <- length(pathway.genes) >> total.balls.in.urn <- length(all.geneIDs) >> black.balls.in.urn <- total.balls.in.urn - >> white.balls.in.urn >> total.balls.drawn.from.urn <- >> length(genes.of.interest) >> dhyper(white.balls.drawn, >> white.balls.in.urn, >> black.balls.in.urn, >> total.balls.drawn.from.urn) >> }) >> >> print(pVals.by.pathway) >> >> ---- richer version >> >> library(KEGGREST) >> library(org.Hs.eg.db) >> >> # created named list, length 449, eg: >> # path:hsa00010: "Glycolysis / Gluconeogenesis" >> >> pathways <- keggList("pathway", "hsa") >> >> # make them into KEGG-style human pathway identifiers >> human.pathways <- sub("path:", "", names(pathways)) >> >> # for demonstration, just use the first ten pathways >> >> demo.pathway.ids <- head(human.pathways, 10) >> demo.pathways <- setNames(keggGet(demo.pathway.**ids), demo.pathway.ids) >> >> genes.by.pathway <- lapply(demo.pathways, function(demo.pathway) { >> demo.pathway$GENE[c(TRUE, FALSE)] >> }) >> >> all.geneIDs <- keys(org.Hs.eg.db) >> >> # chose one of these for demonstration. the first (a whole genome >> random >> # set of 100 genes) has very little enrichment, the second, a random >> set >> # from the pathways themselves, has very good enrichment in some >> pathways >> >> set.seed(123) >> significant.genes <- sample(all.geneIDs, size=100) >> #significant.genes <- sample(unique(unlistgenes.by.**pathway)), size=10) >> >> # the hypergeometric distribution is traditionally explained in terms >> of >> # drawing a sample of balls from an urn containing black and white >> balls. >> # to keep the arguments straight (in my mind at least), I use these >> terms >> # here also >> >> hyperg <- Category:::.doHyperGInternal >> hyperg.test <- >> function(pathway.genes, significant.genes, all.genes, over=TRUE) >> { >> white.balls.drawn <- length(intersect(significant.**genes, >> pathway.genes)) >> white.balls.in.urn <- length(pathway.genes) >> total.balls.in.urn <- length(all.genes) >> black.balls.in.urn <- total.balls.in.urn - white.balls.in.urn >> balls.pulled.from.urn <- length(significant.genes) >> hyperg(white.balls.in.urn, black.balls.in.urn, >> balls.pulled.from.urn, white.balls.drawn, over) >> } >> >> pVals.by.pathway <- >> t(sapply(genes.by.pathway, hyperg.test, significant.genes, >> all.geneIDs)) >> >> print(pVals.by.pathway) >> >> >> On Sep 10, 2013, at 9:23 AM, Enrico Ferrero wrote: >> >> Hi Paul, >>> >>> Thanks for your suggestions. >>> I already performed a GO enrichment analysis and I was specifically >>> looking at the enrichment of pathways. You also suggest GSEA, but, >>> correct me if I'm wrong, I think I would need some additional value to >>> rank my list of gene IDs, as the GSEA algorithm requires a ranked list >>> to start with. >>> I guess at the moment my best option is to use Category and the >>> outdated KEGG.db to have an idea of the enriched pathways and then use >>> some non-Bioconductor tool (any suggestion?). >>> >>> I'm not sure this is the right place, but, on a more general note, may >>> I ask if there are any plans to provide better and more integrated >>> support for pathway enrichment analysis in Bioconductor? >>> For example, would it be possible to build something similar to an up >>> to date KEGG.db using KEGG's REST API? After all, packages such as >>> gage and ROntoTools get their information from KEGG in that way. >>> Alternatively, while not as comprehensive as KEGG, Reactome is >>> completely open source and open access and I would be surprised if >>> there wasn't a more fruitful way to integrate it with the core >>> Bioconductor tools an packages. >>> >>> I'm probably being naive here, but I'm afraid I don't fully understand >>> what is the general consensus of the Bioconductor community on current >>> status and future directions of pathway analysis tools. >>> >>> Thank you. >>> Best, >>> >>> On 9 September 2013 16:25, Paul Shannon <paul.thurmond.shannon@gmail.**>>> com <paul.thurmond.shannon@gmail.com>> wrote: >>> >>>> Hi Enrico, >>>> >>>> reactome.db is best described, I believe, as simply a bioc rendering of >>>> Reactome's sql database -- Marc, please correct me if I am wrong. >>>> >>>> There is thus a data representation obstacle when using reactome.db: >>>> molecular relations are described, with pathway/gene mappings not so easy >>>> to get at. >>>> In addition, and despite Reactome's many strengths, its coverage is >>>> incomplete. The canonical wnt pathway, for instance, is (at my last check) >>>> not included. >>>> >>>> If you have a list of geneIDs, exploratory analysis can usefully start >>>> out with both GO enrichment, KEGG enrichment, and GSEA. Though the >>>> information in KEGG.db has not been updated in a couple of years, the >>>> information there is still very useful for exploratory data analysis. Any >>>> enrichments you discover using these assorted gene/ctaegory associations >>>> may lead you to a close study of particular functions or pathways, and it >>>> is this point that you may wish to get the latest and most specific >>>> information via KEGGREST and Reactome (and, with our next release) the new >>>> PSICQUIC package (see http://code.google.com/p/**psicquic/<http: code.google.com="" p="" psicquic=""/> >>>> ). >>>> >>>> I hope this helps. Let us know if it falls short, or if new questions >>>> arise. >>>> >>>> - Paul >>>> >>>> >>>> On Sep 9, 2013, at 7:53 AM, Enrico Ferrero wrote: >>>> >>>> Dear list, >>>>> >>>>> Can anybody suggest how to perform a simple pathway enrichment >>>>> analysis starting from a list of gene IDs? >>>>> >>>>> I know about the gage and ROntoTools packages that use KEGGREST to >>>>> retrieve an up to date version of the KEGG database, but, as far as I >>>>> understand, they require a microarray experiment as input (or at least >>>>> fold changes and pvalues). >>>>> >>>>> Since this time around I'm not starting from a microarray experiment >>>>> but I just have a gene list, I'm looking for a way to perform pathway >>>>> enrichment analysis using a simple numerical method such as Fisher's / >>>>> hypergeometric test. >>>>> >>>>> I know the Category package still provides a KEGGHyperG class (which >>>>> would be perfect!), but the results are based on the outdated version >>>>> of KEGG (via KEGG.db, I guess). >>>>> >>>>> Are there any good alternatives available out there? Would it be >>>>> possible to use reactome.db in conjunction with the Category/GOstats >>>>> functions for example? >>>>> >>>>> Thank you! >>>>> Best, >>>>> >>>>> -- >>>>> Enrico Ferrero >>>>> Department of Genetics >>>>> Cambridge Systems Biology Centre >>>>> University of Cambridge >>>>> >>>>> ______________________________**_________________ >>>>> Bioconductor mailing list >>>>> Bioconductor@r-project.org >>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: sta="" t.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>>> Search the archives: http://news.gmane.org/gmane.** >>>>> science.biology.informatics.**conductor<http: news.gmane.org="" gm="" ane.science.biology.informatics.conductor=""> >>>>> >>>> >>>> >>> -- >>> Enrico Ferrero >>> Department of Genetics >>> Cambridge Systems Biology Centre >>> University of Cambridge >>> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > -- Enrico Ferrero Department of Genetics Cambridge Systems Biology Centre University of Cambridge [[alternative HTML version deleted]]

Dear Paul - I know a while has elapsed since your post, but I'm trying it out and am stuck at one point. I think the issue I have is inexperience with R/coding rather than a specific problem with this method. Bearing this in mind, would you be able to help?

library(KEGGREST)

install.packages("org.Hs.eg.db")
biocLite("org.Hs.eg.db")
library(org.Hs.eg.db)

...and that seems to have gone fine. I've made the Pathways.list and pathway.codes objects without any trouble too.

However, it's at the point of the 'simple' script where you upload your genes of interest into the command:

# subsetting by c(TRUE, FALSE) -- which repeats
# as many times as needed, sorts through some
# unexpected packaging of geneIDs in the GENE element
# of each pw[[n]]

genes.by.pathway <- sapply(pathway.codes, function(pwid){ pw <- keggGet(pwid) pw[[1]]$GENE[c(TRUE, FALSE)] }) ...so I replace the term '$GENE' with my vector of Gene IDs (Entrez IDs), - in bold below "res.PCA3.Sig.ENTREZ" and the code fails.

> genes.by.pathway <- sapply(pathway.codes,
+                            function(pwid){
+                              pw <- keggGet(pwid)
+                              pw[[1]]res.PCA3.Sig.ENTREZ [c(TRUE, FALSE)]
Error: unexpected symbol in:
"                             pw <- keggGet(pwid)
pw[[1]]res.PCA3.Sig.ENTREZ"
>                            })
Error: unexpected '}' in "                     

I presume my mistake is fairly basic....and hopefully easy to rectify. Something to do with the dollar sign? Or do I need to define functions separately before running this part?

Thanks for any insight,

Matt

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.