Question

GO with R and Bioconductor

0

Entering edit mode

Duke ▴ 210

@duke-4050

Last seen 9.6 years ago

On 2/28/11 2:31 PM, Vincent Carey wrote: > On Mon, Feb 28, 2011 at 2:07 PM, Duke<duke.lists at="" gmx.com=""> wrote: >> On 2/26/11 8:09 PM, Vincent Carey wrote: >>> On Sat, Feb 26, 2011 at 6:40 PM, Duke<duke.lists at="" gmx.com=""> wrote: >>>> Hi Vincent, >>>> >>>> On 2/26/11 2:38 PM, Vincent Carey wrote: >>>>> On Sat, Feb 26, 2011 at 12:00 PM,<duke.lists at="" gmx.com=""> wrote: >>>>>> Dear colleagues, >>>>>> >>>>>> I used to download GO database from geneoncology.org and did some c++ >>>>>> coding to manipulate the data as I wished. Now I want to try my luck >>>>>> with R >>>>>> - Bioconductor. I have heard of tons of tools supporting GO such as >>>>>> GO.db, >>>>>> topGO, goseq, GOstats, biomart etc... and I have been reading their >>>>>> description and examples, but honestly I am overhelmed and dont really >>>>>> know >>>>>> which package I should use to fulfill my task. So please advise me how >>>>>> I can >>>>>> do the following two simple tasks: >>>>>> >>>>>> 1. I have a list of genes (with gene names from UCSC such as Foxp3 >>>>>> etc...). How do I filter this list to get genes that have certain GO >>>>>> term >>>>>> such as transcription factor? >>>>> since you said it was a simple task, consider the simple solution >>>>> involving the "%annTo%" operator, which tells whether the symbols on >>>>> the left have been annotated to the term on the right: >>>>> >>>>>> c("FOXP3", "BRCA2") %annTo% "mammary" >>>>> FOXP3 BRCA2 >>>>> FALSE TRUE >>>>>> c("FOXP3", "BRCA2") %annTo% "transcription factor" >>>>> FOXP3 BRCA2 >>>>> TRUE FALSE >>>>> >>>>> you could use the named logical vectors generated in this way to >>>>> perform the filtering you describe. but see below. >>>>> >>>>>> 2. How do I know the capacity of the latest GO database on >>>>>> bioconductor, >>>>>> for example, how many genes available for mm9, and how many of them >>>>>> have GO >>>>>> term transcription factor? >>>>> The "GO database" concerns the gene ontology, a structure of terms and >>>>> relationships among them. The association of GO terms to gene names >>>>> for mouse is presented in various ways, but the most basic one is in >>>>> the org.Mm.eg.db package. With that, you could use >>>>> >>>>> org.Mm.eg() >>>>> >>>>> to find, among other statistics, >>>>> >>>>> org.Mm.egGO has 29984 mapped keys (of 63329 keys). Your question >>>>> concerning transcription factor mapping is not completely precise, and >>>>> you might want to survey the family of GO terms to come up with a set >>>>> of terms that meets your requirement. >>>> You are right, I did not make it clear. What I wanted was exactly what >>>> you >>>> said: I need to get a list of genes that have a certain set of terms (for >>>> example *transcription*) - which corresponds to a family of GO terms, not >>>> just one GO term. >>>> >>>>> Here's a >>>>> demonstration of related queries: >>>>> >>>>>> get("GO:0003700", GOTERM) >>>>> GOID: GO:0003700 >>>>> Term: sequence-specific DNA binding transcription factor activity >>>>> Ontology: MF >>>>> Definition: Interacting selectively and non-covalently with a specific >>>>> DNA sequence in order to modulate transcription. The transcription >>>>> factor may or may not also interact selectively with a protein or >>>>> macromolecular complex. >>>>> Synonym: GO:0000130 >>>>> Secondary: GO:0000130 >>>>>> tfg = get("GO:0003700", revmap(org.Mm.egGO)) >>>>>> length(tfg) >>>>> [1] 940 >>>> This works fine in case of one GO term (GO: 0003700). Is there a similar >>>> function like getterm("transcription", revmap()) to get all the genes >>>> that >>>> their GO terms contain *transcription*? >>> As far as I know there is no such function. Solutions depend on the >>> type of wild-card you are looking for. If SQL LIKE operator is >>> adequate >>> >>> termlikeToTags = function (liketok) >>> { >>> require(GO.db) >>> gcon = GO_dbconn() >>> as.character(dbGetQuery(gcon, paste("select go_id from go_term >>> where term like '%", >>> liketok, "%'", sep = ""))[[1]]) >>> } >>> >>> if you want to use full regular expressions >>> >>> reToTags = function (retok, ...) { >>> require(GO.db) >>> allt = dbGetQuery(GO_dbconn(), "select go_id, term from go_term") >>> inds = grep(retok, as.character(allt[,"term"]), value=FALSE, ...) >>> # will error if value set at call >>> if (length(inds)>0) return(as.character(allt[inds,"go_id"])) >>> stop("retok did not grep to any term") >>> } >>> >>>> length(reToTags("transcription")) >>> [1] 372 >>>> length(termlikeToTags("transcription")) >>> [1] 372 >>>> length(reToTags("transcrip.*")) >>> [1] 411 >>>> length(termlikeToTags("transcrip.*")) >>> [1] 0 >>> >> As far as I understand, these functions will list out all the GO Terms >> containing "transcription" for example. In order to search for all of the >> genes in the database that have these GO Terms, I suppose I will have to >> loop >> >> tfg = get("GO:0003700", revmap(org.Mm.egGO)) >> length(tfg) >> >> for each of the term and then sum them up? Since your %annTo% function is to >> check a certain gene list to see if any of them has the specified term, >> would it be better to just run that function to the available genes in the >> database? How do I list and use the gene list in the database? > I'd suggest you read the vignettes at > http://www.bioconductor.org/help/bioc- views/release/bioc/html/AnnotationDbi.html, > particularly http://www.bioconductor.org/packages/2.7/bioc/vignettes /AnnotationDbi/inst/doc/AnnotationDbi.pdf > > The codes I provided are hints in the direction of solutions for the > family of questions you pose. > >>>>> org.Mm.egGO is a mapping from mouse entrez gene ids to GO term tags. >>>>> revmap reverses this mapping and takes a tag to the set of genes >>>>> mapped to the tag by entrez. >>>>> >>>>> Now, to return to the first question -- it isn't simple and a lot of >>>>> presuppositions have to be made explicit. One of the most problematic >>>>> is the commitment to use gene symbols. If you don't read the docs >>>>> about bioconductor annotation and R packages pertaining thereto, it's >>>>> hard to make progress. >>>> Can you elaborate a little bit more why using gene symbols is >>>> problematic? >>>> And if it is really a problem, what gene ID I should use to avoid that >>>> problem? >>> Entrez gene IDs are generally more specific. If you haven't run into >>> collisions and ambiguities with symbols by now, maybe you don't have >>> to worry about it. >>> >> Well, our reference database is UCSC refFlat downloaded from UCSC genome >> browser, and in that format there is only geneName (SYMBOL) and name >> (refSEQ). There is no Entrez Gene ID in that file. We actually have a >> problem of one same geneName (SYMBOL) has more than one isoforms (more than >> one refSEQ), but in fact we dont have a good way of differentiate between >> those (by mathematical methods). So we use either gene symbol or refSeq ID. >> > most of our annotation packages have mappings that employ refseq identifiers. > the functions i provided can be readily altered to use refseq ids as > keys, and if you > introduce types for the identifier/term tokens, the functions can be > implemented as methods > that perform appropriately for inputs from different vocabularies. > but you will have to > learn how to use the mapping objects or the sqlite tables > productively. Package GSEABase is > also a useful resource for working with collections of gene identifiers. > Thanks for both of your suggestion Vincent. I will definitely check them out. Your suggestions/hints do help me a lot :). Bests, D.

Transcription Annotation GO GOstats biomaRt topGO GSEABase goseq Transcription Annotation • 1.8k views

ADD COMMENT • link 13.2 years ago Duke ▴ 210