GO classification

0

Entering edit mode

Ganiraju ▴ 10

@ganiraju-2420

Last seen 9.6 years ago

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20071009/ 0269e0cd/attachment.pl

• 1.9k views

ADD COMMENT • link updated 16.5 years ago by Martin Morgan 25k • written 16.6 years ago by Ganiraju ▴ 10

0

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 7.7 years ago

United States

Ganiraju wrote: > hi, > > I got a set of significant GOids by running my data using GOhypergtest of > Gostats. Now im trying to classify these GO ids using the GO_slim > classifier. Is there any package in R which can accomplish this job?? > > Thanks > Gani > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > We certainly have GO annotation packages. Could you please go into more detail about what it is that you are trying to do? Marc

ADD COMMENT • link 16.6 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Steffen ▴ 500

@steffen-2351

Last seen 9.6 years ago

Hi Ganiraju, I don't think there is an R package for mapping a set of GO identifiers to GO slim but it should be possible to write R code that does this mapping by using the GO annotation packages. Alternatively you could use the map2slim.pl perl script see: http://search.cpan.org/~cmungall/go-perl/scripts/map2slim This script is available in the go-perl module at: http://search.cpan.org/~cmungall/go-perl/ Cheers, Steffen ----- Original Message ----- From: Ganiraju <ganiraju@gmail.com> Date: Tuesday, October 9, 2007 12:40 pm Subject: [BioC] GO classification To: bioconductor at stat.math.ethz.ch > hi, > > I got a set of significant GOids by running my data using > GOhypergtest of > Gostats. Now im trying to classify these GO ids using the GO_slim > classifier. Is there any package in R which can accomplish this job?? > > Thanks > Gani > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 16.6 years ago Steffen ▴ 500

0

Entering edit mode

Celine Carret ▴ 220

@celine-carret-1477

Last seen 9.6 years ago

Hi, you should have a look at this: http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/goSlim .t xt I found it very clear, and useful, and if you don't work on human- chips you can customise the script easily. Best regards Celine -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Ganiraju Sent: 09 October 2007 20:37 To: bioconductor at stat.math.ethz.ch Subject: [BioC] GO classification hi, I got a set of significant GOids by running my data using GOhypergtest of Gostats. Now im trying to classify these GO ids using the GO_slim classifier. Is there any package in R which can accomplish this job?? Thanks Gani [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

ADD COMMENT • link 16.5 years ago Celine Carret ▴ 220

0

Entering edit mode

GO_Slim categories are usually sets of very general custom GO terms. Examples of these GO_Slim sets can be downloaded from geneontology.org: http://www.geneontology.org/GO.slims.shtml When working with GOstats, you can simply use your favorite GO_Slim set for subsetting your enrichment analysis results and then plot the corresponding counts. Usually, GO_Slim representations, especially pie charts, pretend that the different items (e.g. genes) are assigned to only one category - which is typically not the case - since there are duplicates almost everywhere. As a suggestion to the developers: Considering the popularity of these GO_Slim representations, it might be useful to add some instructions in the GOstats PDF that illustrate to users how to generate counts and plots for their favorite GO_Slim categories? Perhaps with a proper warning about the limitations of these analyses. Best, Thomas On Wed 10/10/07 11:23, Celine Carret wrote: > Hi, > you should have a look at this: > http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/goSl im.t > xt > > I found it very clear, and useful, and if you don't work on human- chips > you can customise the script easily. > > Best regards > Celine > > -----Original Message----- > From: bioconductor-bounces at stat.math.ethz.ch > [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Ganiraju > Sent: 09 October 2007 20:37 > To: bioconductor at stat.math.ethz.ch > Subject: [BioC] GO classification > > hi, > > I got a set of significant GOids by running my data using GOhypergtest > of > Gostats. Now im trying to classify these GO ids using the GO_slim > classifier. Is there any package in R which can accomplish this job?? > > Thanks > Gani > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Thomas Girke Assistant Professor of Bioinformatics Director, IIGB Bioinformatic Facility Center for Plant Cell Biology (CEPCEB) Institute for Integrative Genome Biology (IIGB) Department of Botany and Plant Sciences 1008 Noel T. Keen Hall University of California Riverside, CA 92521 E-mail: thomas.girke at ucr.edu Website: http://faculty.ucr.edu/~tgirke Ph: 951-827-2469 Fax: 951-827-4437

ADD REPLY • link 16.5 years ago Thomas Girke ★ 1.7k

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 2 days ago

United States

Hi Thomas -- I like the idea of incorporating slims into an existing package (and have started to in the development version of GSEABase, version 1.1.5, when it appears). I've been playing around a bit and wanted to get some feedback from you and the list. There's an immediately usable version of the underlying code at the end of the message (no devel version of GSEABase required). The code below is considerably faster than that posted earlier. In GSEABase, the function getOBOCollection parses 'obo' files, as these seem to be how 'slim' sets are sometimes stored. With GSEABase, you'd > fl <- "http://www.geneontology.org/GO_slims/goslim_plant.obo" > oboSlim <- getOBOCollection(fl) > oboSlim collectionType: GO ids: GO:0000003, GO:0000166, ..., GO:0045182 (106 total) evidenceCode: IMP IPI TAS ISS IDA NAS IEA IGI RCA IEP IC NR ND 'fl' could be a local file. You could also create an ad hoc collection with > slimTerms <- GOCollection(c("GO:0030246", "GO:0008289", "GO:0003676", + "GO:0000166")) Use the collection of slim ids to classify your own ids, e.g., > my_GOIds <- c("GO:0016564", "GO:0003677", "GO:0004345", "GO:0004345", + "GO:0004345", "GO:0004345", "GO:0004345", "GO:0008265", + "GO:0003841", "GO:0030151", "GO:0006355", "GO:0009664", + "GO:0006412", "GO:0006412", "GO:0006412", "GO:0007046", + "GO:0015979", "GO:0006457", "GO:0008372", "GO:0005618", + "GO:0005622", "GO:0005840", "GO:0015935", "GO:0000311", + "GO:0005622", "GO:0009282") > goSlim(GOCollection(my_GOIds), oboSlim, "MF") Count Percent Term GO:0000166 0 0.000000 nucleotide binding GO:0003674 6 40.000000 molecular_function GO:0003676 1 6.666667 nucleic acid binding GO:0003677 1 6.666667 DNA binding GO:0003682 0 0.000000 chromatin binding GO:0003700 0 0.000000 transcription factor activity GO:0003723 0 0.000000 RNA binding GO:0003774 0 0.000000 motor activity GO:0003824 3 20.000000 catalytic activity GO:0004518 0 0.000000 nuclease activity GO:0004871 0 0.000000 signal transducer activity GO:0004872 0 0.000000 receptor activity GO:0005102 0 0.000000 receptor binding GO:0005198 0 0.000000 structural molecule activity GO:0005215 0 0.000000 transporter activity GO:0005488 2 13.333333 binding GO:0005515 0 0.000000 protein binding GO:0008135 0 0.000000 translation factor activity, nuclei... GO:0008289 0 0.000000 lipid binding GO:0016301 0 0.000000 kinase activity GO:0016740 1 6.666667 transferase activity GO:0016787 0 0.000000 hydrolase activity GO:0019825 0 0.000000 oxygen binding GO:0030234 0 0.000000 enzyme regulator activity GO:0030246 0 0.000000 carbohydrate binding GO:0030528 1 6.666667 transcription regulator activity GO:0045182 0 0.000000 translation regulator activity This is just a data frame, with rows being slim ids. More fun (and needing some more thought -- see below) is asking about the GO categories implied by an expression set, e.g., > data(sample.ExpressionSet) > goSlim(sample.ExpressionSet, oboSlim, "MF") Count Percent Term GO:0000166 5 0.6157635 nucleotide binding GO:0003674 283 34.8522167 molecular_function GO:0003676 17 2.0935961 nucleic acid binding GO:0003677 10 1.2315271 DNA binding GO:0003682 2 0.2463054 chromatin binding GO:0003700 1 0.1231527 transcription factor activity GO:0003723 5 0.6157635 RNA binding GO:0003774 2 0.2463054 motor activity GO:0003824 82 10.0985222 catalytic activity GO:0004518 5 0.6157635 nuclease activity GO:0004871 36 4.4334975 signal transducer activity GO:0004872 29 3.5714286 receptor activity GO:0005102 15 1.8472906 receptor binding GO:0005198 7 0.8620690 structural molecule activity GO:0005215 44 5.4187192 transporter activity GO:0005488 112 13.7931034 binding GO:0005515 49 6.0344828 protein binding GO:0008135 1 0.1231527 translation factor activity, nuclei... GO:0008289 4 0.4926108 lipid binding GO:0016301 13 1.6009852 kinase activity GO:0016740 23 2.8325123 transferase activity GO:0016787 40 4.9261084 hydrolase activity GO:0019825 1 0.1231527 oxygen binding GO:0030234 14 1.7241379 enzyme regulator activity GO:0030246 3 0.3694581 carbohydrate binding GO:0030528 8 0.9852217 transcription regulator activity GO:0045182 1 0.1231527 translation regulator activity I'm not sure I've made the right decisions about duplicated GO terms, and would appreciate any feedback. Here's one example: > goSlim(GOCollection("GO:0000016"), oboSlim, "MF") Count Percent Term GO:0000166 0 0.00000 nucleotide binding GO:0003674 1 33.33333 molecular_function GO:0003676 0 0.00000 nucleic acid binding GO:0003677 0 0.00000 DNA binding GO:0003682 0 0.00000 chromatin binding GO:0003700 0 0.00000 transcription factor activity GO:0003723 0 0.00000 RNA binding GO:0003774 0 0.00000 motor activity GO:0003824 1 33.33333 catalytic activity GO:0004518 0 0.00000 nuclease activity GO:0004871 0 0.00000 signal transducer activity GO:0004872 0 0.00000 receptor activity GO:0005102 0 0.00000 receptor binding GO:0005198 0 0.00000 structural molecule activity GO:0005215 0 0.00000 transporter activity GO:0005488 0 0.00000 binding GO:0005515 0 0.00000 protein binding GO:0008135 0 0.00000 translation factor activity, nuclei... GO:0008289 0 0.00000 lipid binding GO:0016301 0 0.00000 kinase activity GO:0016740 0 0.00000 transferase activity GO:0016787 1 33.33333 hydrolase activity GO:0019825 0 0.00000 oxygen binding GO:0030234 0 0.00000 enzyme regulator activity GO:0030246 0 0.00000 carbohydrate binding GO:0030528 0 0.00000 transcription regulator activity GO:0045182 0 0.00000 translation regulator activity I'm also intending to enable expression set subsetting based on GOCollections (you can currently do this with GeneSets defined in GSEABase). Here are a couple of issues. The hierarchies in oboSlim overlap, so GO:0000016 gets classified three different ways. I think this is one of the problems Thomas alluded to. A more complicated example is going from ExpressionSet. Each probe might map to several different GO terms (each of which can be classified to several different slim terms). Several features in an ExpressionSet can map to the same GO term. Currently, I make the GO terms unique, so this > goSlim(GOCollection(c("GO:0000016", "GO:0000016")), oboSlim, "MF") produces the same result as above. The result of goSlim is the classification of all unique terms implied by the feature names, unweighted by the frequency of the term implied by the feature names. Again I'm not sure that this is the most helpful; certainly it makes any kind of statistical assessment difficult. For those wanting to use this kind of functionality without heading into the devel branch immediately, here's some code and how to use it: ## Group GO ids into GO_ontology-specific GO_slim categories GO_slim <- function(ids, GO_slim, GO_ontology="MF", verbose=FALSE) { require("AnnotationDbi") require("GO") ## Get GO_slim terms, restricted to GO_ontology terms <- mget(GO_slim, GOTERM, ifnotfound=NA) if (anyis.na(terms))) { if (verbose) warning("GO_slim ids not found: ", paste(names(terms)[is.na(terms)], collapse=" ")) terms <- terms[!is.na(terms)] } terms <- terms[sapply(terms, Ontology)==GO_ontology] GO_slim <- names(terms) ## Use GO_ontology to find the required offspring OFFSPRING <- switch(GO_ontology, MF=GOMFOFFSPRING, BP=GOBPOFFSPRING, CC=GOCCOFFSPRING, stop("GO_ontology must be 'MF', 'BP', or 'CC'")) ## Get the offspring of GO_slim slim <- mget(GO_slim, OFFSPRING, ifnotfound=NA) slim <- slim[!is.na(slim)] ## Reverse the relationship: 'offspring' become keys, 'parents' ## become values. Select the sampled offspring ids <- unique(ids) samp <- revmap(slim)[ids] samp <- samp[!sapply(samp, is.null)] ## Count occurences of each slim cnt <- table(unlist(samp)) ## Adjust for sample ids matching slim ids idx <- table(ids[which(ids %in% names(slim))]) idx_n<- names(idx) cnt[idx_n] <- idx + ifelseis.na(cnt[idx_n]), 0, cnt[idx_n]) ## Prepare a data frame for results df <- data.frame(Slim=names(terms), Count=0L, Percent=0, Term=sprintf("%.35s%s", sapply(terms, Term, USE.NAMES=FALSE), ifelse(nchar(sapply(terms, Term))>35, "...", "")), row.names=1) ## add our counts df[names(cnt),c("Count", "Percent")] <- c(cnt, 100*cnt/sum(cnt)) df[order(row.names(df)),] } ## Read 'obo' files for GO ids, from the web or disk GO_oboIds <- function(src) { ## Parse OBO into 'stanza' and 'kv' (key-value) tables. ## VERY NAIVE data <- readLines(src) parser <- list(stanza="^\\[(.*)\\]", kv="^([^:]*):\\s*(.*)") stanza <- data.frame(id=c(0,grep(parser$stanza, data)), value=c("Root", sub(parser$stanza, "\\1", grep(parser$stanza, data, value=TRUE))), stringsAsFactors=FALSE) kv_pairs <- grep(parser$kv, data, value=TRUE) kv_id <- grep(parser$kv, data) stanza_id <- sapply(kv_id, function(x) { idx <- x > stanza$id stanza$id[xor(idx, c(idx[-1], FALSE))] }) kv <- data.frame(id=kv_id, stanza_id=stanza_id, key=sub(parser$kv, "\\1", kv_pairs), value=sub(parser$kv, "\\2", kv_pairs), stringsAsFactors=FALSE) ## Get GO ids merge(kv[kv$key=="id", names(kv)!="key", drop=FALSE], stanza[stanza$value=="Term", names(stanza)!="value", drop=FALSE], by.x="stanza_id", by.y="id")$value } > ## Thomas Girke's original (Sept 2005) slim ids, from > ## http://www.geneontology.org/GO.slims.shtml > slimIds <- c("GO:0030246", "GO:0008289", "GO:0003676", "GO:0000166", > "GO:0019825", "GO:0005515", "GO:0003824", "GO:0030234", > "GO:0005554", "GO:0003774", "GO:0004871", "GO:0005198", > "GO:0030528", "GO:0045182", "GO:0005215", "GO:0000004", > "GO:0006519", "GO:0007154", "GO:0007049", "GO:0016043", > "GO:0006412", "GO:0006464", "GO:0006810", "GO:0007275", > "GO:0007049", "GO:0006519", "GO:0005975", "GO:0006629", > "GO:0006139", "GO:0019748", "GO:0015979", "GO:0005618", > "GO:0005829", "GO:0005783", "GO:0005768", "GO:0005794", > "GO:0005739", "GO:0005777", "GO:0009536", "GO:0005840", > "GO:0005773", "GO:0005764", "GO:0005856", "GO:0005634", > "GO:0005886", "GO:0008372", "GO:0005576") > ## Some GO ids > my_GOIds <- c("GO:0016564", "GO:0003677", "GO:0004345", "GO:0004345", > "GO:0004345", "GO:0004345", "GO:0004345", "GO:0008265", > "GO:0003841", "GO:0030151", "GO:0006355", "GO:0009664", > "GO:0006412", "GO:0006412", "GO:0006412", "GO:0007046", > "GO:0015979", "GO:0006457", "GO:0008372", "GO:0005618", > "GO:0005622", "GO:0005840", "GO:0015935", "GO:0000311", > "GO:0005622", "GO:0009282") > ## classify GOIds into slimIds for the "MF" ontology > res <- GO_slim(my_GOIds, slimIds, "MF") > idx <- res$Count>0 > pie(res[idx, "Count"], row.names(res)[idx]) > data(sample.ExpressionSet) > require(annotation(sample.ExpressionSet)) > evidenceCode <- "TAS" # restict to particular evidence codes > res <- mget(featureNames(sample.ExpressionSet), hgu95av2GO) > res <- res[!is.na(res)] > gids <- unlist(lapply(res, subListExtract, "GOID")) > ecode <- unlist(lapply(res, subListExtract, "Evidence")) > ugids <- unique(gids[ecode %in% evidenceCode]) > df <- GO_slim(ugids, GO_oboIds(src), "MF") Martin Thomas Girke <thomas.girke at="" ucr.edu=""> writes: > GO_Slim categories are usually sets of very general custom GO > terms. Examples of these GO_Slim sets can be downloaded from > geneontology.org: > > http://www.geneontology.org/GO.slims.shtml > > When working with GOstats, you can simply use your favorite GO_Slim > set for subsetting your enrichment analysis results and then plot > the corresponding counts. Usually, GO_Slim representations, > especially pie charts, pretend that the different items (e.g. genes) > are assigned to only one category - which is typically not the case > - since there are duplicates almost everywhere. > > As a suggestion to the developers: > Considering the popularity of these GO_Slim representations, it > might be useful to add some instructions in the GOstats PDF that > illustrate to users how to generate counts and plots for their > favorite GO_Slim categories? Perhaps with a proper warning about the > limitations of these analyses. > > Best, > > Thomas > > > On Wed 10/10/07 11:23, Celine Carret wrote: >> Hi, >> you should have a look at this: >> http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/goS lim.t >> xt >> >> I found it very clear, and useful, and if you don't work on human- chips >> you can customise the script easily. >> >> Best regards >> Celine >> >> -----Original Message----- >> From: bioconductor-bounces at stat.math.ethz.ch >> [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Ganiraju >> Sent: 09 October 2007 20:37 >> To: bioconductor at stat.math.ethz.ch >> Subject: [BioC] GO classification >> >> hi, >> >> I got a set of significant GOids by running my data using GOhypergtest >> of >> Gostats. Now im trying to classify these GO ids using the GO_slim >> classifier. Is there any package in R which can accomplish this job?? >> >> Thanks >> Gani >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> -- >> The Wellcome Trust Sanger Institute is operated by Genome Research >> Limited, a charity registered in England with number 1021457 and a >> company registered in England with number 2742969, whose registered >> office is 215 Euston Road, London, NW1 2BE. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > Thomas Girke > Assistant Professor of Bioinformatics > Director, IIGB Bioinformatic Facility > Center for Plant Cell Biology (CEPCEB) > Institute for Integrative Genome Biology (IIGB) > Department of Botany and Plant Sciences > 1008 Noel T. Keen Hall > University of California > Riverside, CA 92521 > > E-mail: thomas.girke at ucr.edu > Website: http://faculty.ucr.edu/~tgirke > Ph: 951-827-2469 > Fax: 951-827-4437 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology Shared Resource Director Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (208) 667-2793

ADD COMMENT • link 16.5 years ago Martin Morgan 25k

0

Entering edit mode

Hi Martin, This preliminary implementation of a GO_Slim function in GSEABase looks great to me. I particularly like the option to support the usage of custom GO_Slim sets. One suggestion regarding the removal of duplicates: I suggest to make their removal optional on all different levels (GO-to-Slim and gene-to-GO), and set the default to no duplicate removal on any level. This, I think, is the most common way these analyses are performed or how most users would like to perform them. But I might we wrong here? Best, Thomas On Fri 10/12/07 13:32, Martin Morgan wrote: > Hi Thomas -- > > I like the idea of incorporating slims into an existing package (and > have started to in the development version of GSEABase, version 1.1.5, > when it appears). I've been playing around a bit and wanted to get > some feedback from you and the list. > > There's an immediately usable version of the underlying code at the > end of the message (no devel version of GSEABase required). The code > below is considerably faster than that posted earlier. > > In GSEABase, the function getOBOCollection parses 'obo' files, as > these seem to be how 'slim' sets are sometimes stored. With GSEABase, > you'd > > > fl <- "http://www.geneontology.org/GO_slims/goslim_plant.obo" > > oboSlim <- getOBOCollection(fl) > > oboSlim > collectionType: GO > ids: GO:0000003, GO:0000166, ..., GO:0045182 (106 total) > evidenceCode: IMP IPI TAS ISS IDA NAS IEA IGI RCA IEP IC NR ND > > 'fl' could be a local file. You could also create an ad hoc collection > with > > > slimTerms <- GOCollection(c("GO:0030246", "GO:0008289", "GO:0003676", > + "GO:0000166")) > > Use the collection of slim ids to classify your own ids, > e.g., > > > my_GOIds <- c("GO:0016564", "GO:0003677", "GO:0004345", "GO:0004345", > + "GO:0004345", "GO:0004345", "GO:0004345", "GO:0008265", > + "GO:0003841", "GO:0030151", "GO:0006355", "GO:0009664", > + "GO:0006412", "GO:0006412", "GO:0006412", "GO:0007046", > + "GO:0015979", "GO:0006457", "GO:0008372", "GO:0005618", > + "GO:0005622", "GO:0005840", "GO:0015935", "GO:0000311", > + "GO:0005622", "GO:0009282") > > goSlim(GOCollection(my_GOIds), oboSlim, "MF") > Count Percent Term > GO:0000166 0 0.000000 nucleotide binding > GO:0003674 6 40.000000 molecular_function > GO:0003676 1 6.666667 nucleic acid binding > GO:0003677 1 6.666667 DNA binding > GO:0003682 0 0.000000 chromatin binding > GO:0003700 0 0.000000 transcription factor activity > GO:0003723 0 0.000000 RNA binding > GO:0003774 0 0.000000 motor activity > GO:0003824 3 20.000000 catalytic activity > GO:0004518 0 0.000000 nuclease activity > GO:0004871 0 0.000000 signal transducer activity > GO:0004872 0 0.000000 receptor activity > GO:0005102 0 0.000000 receptor binding > GO:0005198 0 0.000000 structural molecule activity > GO:0005215 0 0.000000 transporter activity > GO:0005488 2 13.333333 binding > GO:0005515 0 0.000000 protein binding > GO:0008135 0 0.000000 translation factor activity, nuclei... > GO:0008289 0 0.000000 lipid binding > GO:0016301 0 0.000000 kinase activity > GO:0016740 1 6.666667 transferase activity > GO:0016787 0 0.000000 hydrolase activity > GO:0019825 0 0.000000 oxygen binding > GO:0030234 0 0.000000 enzyme regulator activity > GO:0030246 0 0.000000 carbohydrate binding > GO:0030528 1 6.666667 transcription regulator activity > GO:0045182 0 0.000000 translation regulator activity > > This is just a data frame, with rows being slim ids. More fun (and > needing some more thought -- see below) is asking about the GO > categories implied by an expression set, e.g., > > > data(sample.ExpressionSet) > > goSlim(sample.ExpressionSet, oboSlim, "MF") > Count Percent Term > GO:0000166 5 0.6157635 nucleotide binding > GO:0003674 283 34.8522167 molecular_function > GO:0003676 17 2.0935961 nucleic acid binding > GO:0003677 10 1.2315271 DNA binding > GO:0003682 2 0.2463054 chromatin binding > GO:0003700 1 0.1231527 transcription factor activity > GO:0003723 5 0.6157635 RNA binding > GO:0003774 2 0.2463054 motor activity > GO:0003824 82 10.0985222 catalytic activity > GO:0004518 5 0.6157635 nuclease activity > GO:0004871 36 4.4334975 signal transducer activity > GO:0004872 29 3.5714286 receptor activity > GO:0005102 15 1.8472906 receptor binding > GO:0005198 7 0.8620690 structural molecule activity > GO:0005215 44 5.4187192 transporter activity > GO:0005488 112 13.7931034 binding > GO:0005515 49 6.0344828 protein binding > GO:0008135 1 0.1231527 translation factor activity, nuclei... > GO:0008289 4 0.4926108 lipid binding > GO:0016301 13 1.6009852 kinase activity > GO:0016740 23 2.8325123 transferase activity > GO:0016787 40 4.9261084 hydrolase activity > GO:0019825 1 0.1231527 oxygen binding > GO:0030234 14 1.7241379 enzyme regulator activity > GO:0030246 3 0.3694581 carbohydrate binding > GO:0030528 8 0.9852217 transcription regulator activity > GO:0045182 1 0.1231527 translation regulator activity > > I'm not sure I've made the right decisions about duplicated GO terms, > and would appreciate any feedback. Here's one example: > > > goSlim(GOCollection("GO:0000016"), oboSlim, "MF") > Count Percent Term > GO:0000166 0 0.00000 nucleotide binding > GO:0003674 1 33.33333 molecular_function > GO:0003676 0 0.00000 nucleic acid binding > GO:0003677 0 0.00000 DNA binding > GO:0003682 0 0.00000 chromatin binding > GO:0003700 0 0.00000 transcription factor activity > GO:0003723 0 0.00000 RNA binding > GO:0003774 0 0.00000 motor activity > GO:0003824 1 33.33333 catalytic activity > GO:0004518 0 0.00000 nuclease activity > GO:0004871 0 0.00000 signal transducer activity > GO:0004872 0 0.00000 receptor activity > GO:0005102 0 0.00000 receptor binding > GO:0005198 0 0.00000 structural molecule activity > GO:0005215 0 0.00000 transporter activity > GO:0005488 0 0.00000 binding > GO:0005515 0 0.00000 protein binding > GO:0008135 0 0.00000 translation factor activity, nuclei... > GO:0008289 0 0.00000 lipid binding > GO:0016301 0 0.00000 kinase activity > GO:0016740 0 0.00000 transferase activity > GO:0016787 1 33.33333 hydrolase activity > GO:0019825 0 0.00000 oxygen binding > GO:0030234 0 0.00000 enzyme regulator activity > GO:0030246 0 0.00000 carbohydrate binding > GO:0030528 0 0.00000 transcription regulator activity > GO:0045182 0 0.00000 translation regulator activity > > I'm also intending to enable expression set subsetting based on > GOCollections (you can currently do this with GeneSets defined in > GSEABase). > > Here are a couple of issues. > > The hierarchies in oboSlim overlap, so GO:0000016 gets classified > three different ways. I think this is one of the problems Thomas > alluded to. > > A more complicated example is going from ExpressionSet. Each probe > might map to several different GO terms (each of which can be > classified to several different slim terms). > > Several features in an ExpressionSet can map to the same GO > term. Currently, I make the GO terms unique, so this > > > goSlim(GOCollection(c("GO:0000016", "GO:0000016")), oboSlim, "MF") > > produces the same result as above. The result of goSlim is the > classification of all unique terms implied by the feature names, > unweighted by the frequency of the term implied by the feature > names. Again I'm not sure that this is the most helpful; certainly it > makes any kind of statistical assessment difficult. > > For those wanting to use this kind of functionality without heading > into the devel branch immediately, here's some code and how to use it: > > ## Group GO ids into GO_ontology-specific GO_slim categories > GO_slim <- function(ids, GO_slim, GO_ontology="MF", verbose=FALSE) { > require("AnnotationDbi") > require("GO") > > ## Get GO_slim terms, restricted to GO_ontology > terms <- mget(GO_slim, GOTERM, ifnotfound=NA) > if (anyis.na(terms))) { > if (verbose) > warning("GO_slim ids not found: ", > paste(names(terms)[is.na(terms)], collapse=" ")) > terms <- terms[!is.na(terms)] > } > terms <- terms[sapply(terms, Ontology)==GO_ontology] > GO_slim <- names(terms) > > ## Use GO_ontology to find the required offspring > OFFSPRING <- switch(GO_ontology, > MF=GOMFOFFSPRING, > BP=GOBPOFFSPRING, > CC=GOCCOFFSPRING, > stop("GO_ontology must be 'MF', 'BP', or 'CC'")) > ## Get the offspring of GO_slim > slim <- mget(GO_slim, OFFSPRING, ifnotfound=NA) > slim <- slim[!is.na(slim)] > ## Reverse the relationship: 'offspring' become keys, 'parents' > ## become values. Select the sampled offspring > ids <- unique(ids) > samp <- revmap(slim)[ids] > samp <- samp[!sapply(samp, is.null)] > ## Count occurences of each slim > cnt <- table(unlist(samp)) > ## Adjust for sample ids matching slim ids > idx <- table(ids[which(ids %in% names(slim))]) > idx_n<- names(idx) > cnt[idx_n] <- idx + ifelseis.na(cnt[idx_n]), 0, cnt[idx_n]) > > ## Prepare a data frame for results > df <- data.frame(Slim=names(terms), > Count=0L, Percent=0, > Term=sprintf("%.35s%s", > sapply(terms, Term, USE.NAMES=FALSE), > ifelse(nchar(sapply(terms, Term))>35, "...", "")), > row.names=1) > ## add our counts > df[names(cnt),c("Count", "Percent")] <- c(cnt, 100*cnt/sum(cnt)) > df[order(row.names(df)),] > } > > ## Read 'obo' files for GO ids, from the web or disk > GO_oboIds <- function(src) { > ## Parse OBO into 'stanza' and 'kv' (key-value) tables. > ## VERY NAIVE > data <- readLines(src) > parser <- list(stanza="^\\[(.*)\\]", kv="^([^:]*):\\s*(.*)") > stanza <- data.frame(id=c(0,grep(parser$stanza, data)), > value=c("Root", sub(parser$stanza, "\\1", > grep(parser$stanza, data, value=TRUE))), > stringsAsFactors=FALSE) > kv_pairs <- grep(parser$kv, data, value=TRUE) > kv_id <- grep(parser$kv, data) > stanza_id <- sapply(kv_id, function(x) { > idx <- x > stanza$id > stanza$id[xor(idx, c(idx[-1], FALSE))] > }) > kv <- data.frame(id=kv_id, stanza_id=stanza_id, > key=sub(parser$kv, "\\1", kv_pairs), > value=sub(parser$kv, "\\2", kv_pairs), > stringsAsFactors=FALSE) > > ## Get GO ids > merge(kv[kv$key=="id", names(kv)!="key", drop=FALSE], > stanza[stanza$value=="Term", names(stanza)!="value", > drop=FALSE], > by.x="stanza_id", by.y="id")$value > } > > > ## Thomas Girke's original (Sept 2005) slim ids, from > > ## http://www.geneontology.org/GO.slims.shtml > > slimIds <- c("GO:0030246", "GO:0008289", "GO:0003676", "GO:0000166", > > "GO:0019825", "GO:0005515", "GO:0003824", "GO:0030234", > > "GO:0005554", "GO:0003774", "GO:0004871", "GO:0005198", > > "GO:0030528", "GO:0045182", "GO:0005215", "GO:0000004", > > "GO:0006519", "GO:0007154", "GO:0007049", "GO:0016043", > > "GO:0006412", "GO:0006464", "GO:0006810", "GO:0007275", > > "GO:0007049", "GO:0006519", "GO:0005975", "GO:0006629", > > "GO:0006139", "GO:0019748", "GO:0015979", "GO:0005618", > > "GO:0005829", "GO:0005783", "GO:0005768", "GO:0005794", > > "GO:0005739", "GO:0005777", "GO:0009536", "GO:0005840", > > "GO:0005773", "GO:0005764", "GO:0005856", "GO:0005634", > > "GO:0005886", "GO:0008372", "GO:0005576") > > > ## Some GO ids > > my_GOIds <- c("GO:0016564", "GO:0003677", "GO:0004345", "GO:0004345", > > "GO:0004345", "GO:0004345", "GO:0004345", "GO:0008265", > > "GO:0003841", "GO:0030151", "GO:0006355", "GO:0009664", > > "GO:0006412", "GO:0006412", "GO:0006412", "GO:0007046", > > "GO:0015979", "GO:0006457", "GO:0008372", "GO:0005618", > > "GO:0005622", "GO:0005840", "GO:0015935", "GO:0000311", > > "GO:0005622", "GO:0009282") > > > ## classify GOIds into slimIds for the "MF" ontology > > res <- GO_slim(my_GOIds, slimIds, "MF") > > idx <- res$Count>0 > > pie(res[idx, "Count"], row.names(res)[idx]) > > > data(sample.ExpressionSet) > > require(annotation(sample.ExpressionSet)) > > evidenceCode <- "TAS" # restict to particular evidence codes > > res <- mget(featureNames(sample.ExpressionSet), hgu95av2GO) > > res <- res[!is.na(res)] > > gids <- unlist(lapply(res, subListExtract, "GOID")) > > ecode <- unlist(lapply(res, subListExtract, "Evidence")) > > ugids <- unique(gids[ecode %in% evidenceCode]) > > df <- GO_slim(ugids, GO_oboIds(src), "MF") > > Martin > > Thomas Girke <thomas.girke at="" ucr.edu=""> writes: > > > GO_Slim categories are usually sets of very general custom GO > > terms. Examples of these GO_Slim sets can be downloaded from > > geneontology.org: > > > > http://www.geneontology.org/GO.slims.shtml > > > > When working with GOstats, you can simply use your favorite GO_Slim > > set for subsetting your enrichment analysis results and then plot > > the corresponding counts. Usually, GO_Slim representations, > > especially pie charts, pretend that the different items (e.g. genes) > > are assigned to only one category - which is typically not the case > > - since there are duplicates almost everywhere. > > > > As a suggestion to the developers: > > Considering the popularity of these GO_Slim representations, it > > might be useful to add some instructions in the GOstats PDF that > > illustrate to users how to generate counts and plots for their > > favorite GO_Slim categories? Perhaps with a proper warning about the > > limitations of these analyses. > > > > Best, > > > > Thomas > > > > > > On Wed 10/10/07 11:23, Celine Carret wrote: > >> Hi, > >> you should have a look at this: > >> http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/g oSlim.t > >> xt > >> > >> I found it very clear, and useful, and if you don't work on human-chips > >> you can customise the script easily. > >> > >> Best regards > >> Celine > >> > >> -----Original Message----- > >> From: bioconductor-bounces at stat.math.ethz.ch > >> [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Ganiraju > >> Sent: 09 October 2007 20:37 > >> To: bioconductor at stat.math.ethz.ch > >> Subject: [BioC] GO classification > >> > >> hi, > >> > >> I got a set of significant GOids by running my data using GOhypergtest > >> of > >> Gostats. Now im trying to classify these GO ids using the GO_slim > >> classifier. Is there any package in R which can accomplish this job?? > >> > >> Thanks > >> Gani > >> > >> [[alternative HTML version deleted]] > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at stat.math.ethz.ch > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> > >> -- > >> The Wellcome Trust Sanger Institute is operated by Genome Research > >> Limited, a charity registered in England with number 1021457 and a > >> company registered in England with number 2742969, whose registered > >> office is 215 Euston Road, London, NW1 2BE. > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at stat.math.ethz.ch > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > > > -- > > Thomas Girke > > Assistant Professor of Bioinformatics > > Director, IIGB Bioinformatic Facility > > Center for Plant Cell Biology (CEPCEB) > > Institute for Integrative Genome Biology (IIGB) > > Department of Botany and Plant Sciences > > 1008 Noel T. Keen Hall > > University of California > > Riverside, CA 92521 > > > > E-mail: thomas.girke at ucr.edu > > Website: http://faculty.ucr.edu/~tgirke > > Ph: 951-827-2469 > > Fax: 951-827-4437 > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > Martin Morgan > Computational Biology Shared Resource Director > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M2 B169 > Phone: (208) 667-2793 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Thomas Girke Assistant Professor of Bioinformatics Director, IIGB Bioinformatic Facility Center for Plant Cell Biology (CEPCEB) Institute for Integrative Genome Biology (IIGB) Department of Botany and Plant Sciences 1008 Noel T. Keen Hall University of California Riverside, CA 92521 E-mail: thomas.girke at ucr.edu Website: http://faculty.ucr.edu/~tgirke Ph: 951-827-2469 Fax: 951-827-4437

ADD REPLY • link 16.5 years ago Thomas Girke ★ 1.7k

Login before adding your answer.