"topGOdata" object: How to supply gene scores with a predefined list of genes

0

Entering edit mode

Scott Ochsner ▴ 300

@scott-ochsner-599

Last seen 9.6 years ago

Hi, I would like to attach gene "score" info to a predefined list of interesting genes to generate a topGOdata object. The predefined list of genes was obtained by: > library(limma) > library(topGO) > input<-cbind(FC=fit$coefficients[,1],pval=p.adjust(fit$p.value[,1],met ho d="BH")) > selectFUN<-function(x){return(abs(x[,1]) >=1 & x[,2] < 0.05)} > diffgenes<-selectFUN(input) > myInterestedGenes<-names(which(diffgenes==T)) > geneNames<-rownames(input) > geneList<-factor(as.integer(geneNames %in% myInterestedGenes)) > names(geneList)<-geneNames > str(geneList) Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... - attr(*, "names")= chr [1:34760] "10338001" "10338003" "10338004" "10338017" ... Unfortunately, the predefined list does not contain any DE "score" information. I would greatly appreciate any help in attaching the score information to a predefined list or incorporating p.value as well as fold change cutoffs into a geneSel function when creating a topGOdata object, Thanks for any help, Scott Scott A. Ochsner, PhD One Baylor Plaza BCM130, Houston, TX 77030 Voice: (713) 798-6227 Fax: (713) 790-1275 > sessionInfo() R version 2.9.0 (2009-04-17) i386-pc-mingw32 locale: LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] topGO_1.12.0 SparseM_0.80 GO.db_2.2.11 RSQLite_0.7-1 DBI_0.2-4 AnnotationDbi_1.6.1 Biobase_2.4.1 graph_1.22.2 limma_2.18.2 loaded via a namespace (and not attached): [1] grid_2.9.0 lattice_0.17-25 tools_2.9.0

GO GO • 2.1k views

ADD COMMENT • link updated 14.6 years ago by Adrian Alexa ▴ 400 • written 14.7 years ago by Scott Ochsner ▴ 300

0

Entering edit mode

Adrian Alexa ▴ 400

@adrian-alexa-936

Last seen 9.6 years ago

Hi Scott, I'm not sure I totally understand your question, but if you want to build a "topGOdata" object from a list a genes for which you have scores (quantifying differential expression) there is a simple way to do it. The first thing you need is a named numeric vector, where the gene identifiers are stored in the names attribute of the vector and the numeric values are the respective gene scores. The set of genes found in the names attribute defines the gene universe. For example, the following should work for you: geneList <- p.adjust(fit$p.value[,1],method="BH")) names(geneList) <- geneNames Then you will need to define a function for specifying the list of interesting genes based on the scores (in your case the adjusted p-values). The function must return a logical vector specifying which gene is selected and which not. The function must have one argument, named allScore and must not depend on any attributes of this object. If for example you want to select all genes with an adjusted p-value lower than 0.01, then the function should look like: topDiffGenes <- function(allScore) { return(allScore < 0.01) } Now you can can build a "topGOdata" object as follows (in the code bellow I assume you are using a Bioconductor annotation package, for example "hgu133a") ## build the topGOdata class GOdata <- new("topGOdata", ontology = "BP", allGenes = geneList, geneSel = topDiffGenes, annot = annFUN.db, affyLib = "hgu133a") ## display the GOdata object GOdata I hope this answers your question. Please let me know if you have further problems. Regards, Adrian On Tue, Aug 25, 2009 at 10:04 PM, Ochsner, Scott A<sochsner at="" bcm.tmc.edu=""> wrote: > Hi, > > I would like to attach gene "score" info to a predefined list of > interesting genes to generate a topGOdata object. ?The predefined list > of genes was obtained by: >> library(limma) >> library(topGO) >> > input<-cbind(FC=fit$coefficients[,1],pval=p.adjust(fit$p.value[,1],m etho > d="BH")) >> selectFUN<-function(x){return(abs(x[,1]) >=1 & x[,2] < 0.05)} >> diffgenes<-selectFUN(input) >> myInterestedGenes<-names(which(diffgenes==T)) >> geneNames<-rownames(input) >> geneList<-factor(as.integer(geneNames %in% myInterestedGenes)) >> names(geneList)<-geneNames >> str(geneList) > ?Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... > ?- attr(*, "names")= chr [1:34760] "10338001" "10338003" "10338004" > "10338017" ... > > Unfortunately, the predefined list does not contain any DE "score" > information. > I would greatly appreciate any help in attaching the score information > to a predefined list or incorporating p.value as well as fold change > cutoffs into a geneSel function when creating a topGOdata object, > > Thanks for any help, > > Scott > > Scott A. Ochsner, PhD > One Baylor Plaza BCM130, Houston, TX 77030 > Voice: (713) 798-6227 ?Fax: (713) 790-1275 > >> sessionInfo() > R version 2.9.0 (2009-04-17) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United > States.1252;LC_MONETARY=English_United > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > > other attached packages: > [1] topGO_1.12.0 ? ? ? ?SparseM_0.80 ? ? ? ?GO.db_2.2.11 > RSQLite_0.7-1 ? ? ? DBI_0.2-4 ? ? ? ? ? AnnotationDbi_1.6.1 > Biobase_2.4.1 ? ? ? graph_1.22.2 ? ? ? ?limma_2.18.2 > > loaded via a namespace (and not attached): > [1] grid_2.9.0 ? ? ?lattice_0.17-25 tools_2.9.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 14.6 years ago Adrian Alexa ▴ 400

0

Entering edit mode

Hi Adrian, Thanks for the response. You have confirmed for me that at the moment it is not possible to create a geneSel function which utilizes more than one argument. Unfortunately, I want to utilize a fold change cutoff in addition to a p.value cutoff. The only way I can do this is to provide a predefined list with the structure below where the factor level determines the genes of interest and the universe. Unfortunately, it does not appear possible to also give a gene score (p.value) to the structure below. I guess in situations were one wishes to utilize more than one selection criteria it will not be possible to use the KS test. Don't get me wrong. I still like the added value of being able to compare the classic, elim, and weight algorithms. > str(geneList) ?Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... ?- attr(*, "names")= chr [1:34760] "10338001" "10338003" "10338004" "10338017" ... Thanks, Scott Scott A. Ochsner, PhD One Baylor Plaza BCM130, Houston, TX 77030 Voice: (713) 798-6227 Fax: (713) 790-1275 -----Original Message----- From: Adrian Alexa [mailto:adrian.alexa@gmail.com] Sent: Thursday, August 27, 2009 11:56 AM To: Ochsner, Scott A Cc: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] "topGOdata" object: How to supply gene scores with a predefined list of genes Hi Scott, I'm not sure I totally understand your question, but if you want to build a "topGOdata" object from a list a genes for which you have scores (quantifying differential expression) there is a simple way to do it. The first thing you need is a named numeric vector, where the gene identifiers are stored in the names attribute of the vector and the numeric values are the respective gene scores. The set of genes found in the names attribute defines the gene universe. For example, the following should work for you: geneList <- p.adjust(fit$p.value[,1],method="BH")) names(geneList) <- geneNames Then you will need to define a function for specifying the list of interesting genes based on the scores (in your case the adjusted p-values). The function must return a logical vector specifying which gene is selected and which not. The function must have one argument, named allScore and must not depend on any attributes of this object. If for example you want to select all genes with an adjusted p-value lower than 0.01, then the function should look like: topDiffGenes <- function(allScore) { return(allScore < 0.01) } Now you can can build a "topGOdata" object as follows (in the code bellow I assume you are using a Bioconductor annotation package, for example "hgu133a") ## build the topGOdata class GOdata <- new("topGOdata", ontology = "BP", allGenes = geneList, geneSel = topDiffGenes, annot = annFUN.db, affyLib = "hgu133a") ## display the GOdata object GOdata I hope this answers your question. Please let me know if you have further problems. Regards, Adrian On Tue, Aug 25, 2009 at 10:04 PM, Ochsner, Scott A<sochsner at="" bcm.tmc.edu=""> wrote: > Hi, > > I would like to attach gene "score" info to a predefined list of > interesting genes to generate a topGOdata object. ?The predefined list > of genes was obtained by: >> library(limma) >> library(topGO) >> > input<-cbind(FC=fit$coefficients[,1],pval=p.adjust(fit$p.value[,1],met > ho > d="BH")) >> selectFUN<-function(x){return(abs(x[,1]) >=1 & x[,2] < 0.05)} >> diffgenes<-selectFUN(input) >> myInterestedGenes<-names(which(diffgenes==T)) >> geneNames<-rownames(input) >> geneList<-factor(as.integer(geneNames %in% myInterestedGenes)) >> names(geneList)<-geneNames >> str(geneList) > ?Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... > ?- attr(*, "names")= chr [1:34760] "10338001" "10338003" "10338004" > "10338017" ... > > Unfortunately, the predefined list does not contain any DE "score" > information. > I would greatly appreciate any help in attaching the score information > to a predefined list or incorporating p.value as well as fold change > cutoffs into a geneSel function when creating a topGOdata object, > > Thanks for any help, > > Scott > > Scott A. Ochsner, PhD > One Baylor Plaza BCM130, Houston, TX 77030 > Voice: (713) 798-6227 ?Fax: (713) 790-1275 > >> sessionInfo() > R version 2.9.0 (2009-04-17) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United > States.1252;LC_MONETARY=English_United > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > > other attached packages: > [1] topGO_1.12.0 ? ? ? ?SparseM_0.80 ? ? ? ?GO.db_2.2.11 > RSQLite_0.7-1 ? ? ? DBI_0.2-4 ? ? ? ? ? AnnotationDbi_1.6.1 > Biobase_2.4.1 ? ? ? graph_1.22.2 ? ? ? ?limma_2.18.2 > > loaded via a namespace (and not attached): > [1] grid_2.9.0 ? ? ?lattice_0.17-25 tools_2.9.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 14.6 years ago Scott Ochsner ▴ 300

0

Entering edit mode

Hi Scott, sorry for the late reply. I hope you are still interested in the problem you raised. Your problem is very interested and I must admit I didn't thought about it when I wrote the topGO code. And as you already mentioned there is no trivial way to use more than one score in the gene selection function. However, one could do this with a bit of hacking... and I see two solutions to your problem. 1) If you want to be able to compare the results for a KS test with the ones from some count based test like Fisher's exact test, and for the count based tests you want to use a section criteria involving more than one gene score, then the most easiest way to do it is to build two "topGOdata" objects, one for the KS test, thus containing only the gene score and a dummy selection function and the other one containing a predefined list of interesting genes. Note that once a "topGOdata" object is instantiated, one can update the gene score or the predefined list of genes via the "updateGenes" method. Also very important to note here is that you need to have the same gene universe for the same objects. It will be a bit strange to compare results which are obtained using a different gene universe. Once you have the results, you can use the "GenTable" function to generate a summary of the results. You could feed any of the two "topGOdata" objects to this function (the "topGOresult" object do not directly depend on it). The disadvantage of this approach is the redundancy in the two "topGOdata" objects. However, it should be strait forward to use. 2) If you plan to use KS test and cont based tests, then one can do the following trick. The main observation is that the KS test used here is a ranked based test and thus only the ranking of the genes, based on their p-value, counts. Therefore, one can transform the vector of p-values into a vector of ranks, before building the "topGOdata" object. Now, since the ranks are unique (and we can compute them in this way), we can use them as to index a vector/table. We can build a table, similar with "input" in your first example, in which we can store multiple information on the genes. Then, if you build the "topGOdata" object using these ranks as scores, one could compute of course the KS test, and also define a selection function which will have access the data in the table using the ranks as indices. Of course, the table in this case needs to be a global variable. Once the "topGOdata" object is build it is very easy to update it with a new gene selection function using the "geneSelectionFun<-" method. Note that this method works only with ranked based test statistics and it will not generalize to tests which use the score itself easily! Bellow, you will find a chunk of code which should exemplify the second approach on the ALL data. I hope you'll find the answer in one of the solutions describe. Please let me know if you have further questions. Regards, Adrian ---------------------------------------------------------------------- -------- library(topGO) library(ALL); data(ALL) affyLib <- paste(annotation(ALL), "db", sep = ".") library(package = affyLib, character.only = TRUE) library(genefilter) f1 <- pOverA(0.25, log2(100)) f2 <- function(x) (IQR(x) > 0.5) ff <- filterfun(f1, f2) ## select genes based on the defined filters selGenes <- genefilter(ALL, ff) ## obtain the list of differentially expressed genes classLabel <- as.integer(sapply(ALL$BT, function(x) return(substr(x, 1, 1) == 'T'))) geneList <- getPvalues(exprs(ALL), classlabel = classLabel) ## select 1000 random genes randSel <- integer(length(geneList)) randSel[sample(length(geneList), 1000)] <- as.integer(1) names(randSel) <- names(geneList) ## transform the p-values into ranks scoreIndex <- rank(geneList, ties.method = "first") ## generate a data frame containing all the information you need to select interesting genes GENE.INFO <- data.frame(p.value = geneList[names(geneList)], filtered = selGenes[names(geneList)], random = randSel[names(geneList)]) ## index the data frame with the ranks GENE.INFO <- GENE.INFO[order(scoreIndex), ] ## define a function to select the "significant" genes mySelGenes <- function(allScore) { returnGENE.INFO[allScore, "p.value"] < 0.01) } ## select the genes that passed the filter criteria and have a p-value lower than 0.01 mySelGenes.2 <- function(allScore) { returnGENE.INFO[allScore, "p.value"] < 0.01 & GENE.INFO[allScore, "filtered"]) } ## how many differentially expressed genes are: sum(mySelGenes(scoreIndex)) sum(mySelGenes.2(scoreIndex)) ## build the topGOdata class GOdata <- new("topGOdata", ontology = "BP", allGenes = scoreIndex, geneSel = mySelGenes, nodeSize = 10, annot = annFUN.db, affyLib = affyLib) ## display the GOdata object GOdata ## we can change the selection criteria, by updating the object with mySelGenes.2 geneSelectionFun(GOdata) <- mySelGenes.2 ## same gene universe but different number of interesting genes ... GOdata ## we can use both KS and Fisher's exact test on this object: geneSelectionFun(GOdata) <- mySelGenes resFisher <- runTest(GOdata, algorithm = "classic", statistic = "fisher") resKS <- runTest(GOdata, algorithm = "classic", statistic = "KS") ## Fisher's exact test accounting for the gene filter geneSelectionFun(GOdata) <- mySelGenes.2 resFisher.2 <- runTest(GOdata, algorithm = "classic", statistic = "fisher") ## we can see some summary for each result resFisher resFisher.2 resKS ## we can generate the table which summarises the results ## but the stats. in the table will depend on the current selection function stored in the GOdata object. allRes <- GenTable(GOdata, classic = resFisher, KS = resKS, classic.2 = resFisher.2, orderBy = "classic.2", ranksOf = "classic", topNodes = 20) allRes ---------------------------------------------------------------------- -------- R version 2.9.0 (2009-04-17) i686-pc-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US .UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_N AME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTI FICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] multtest_2.1.1 genefilter_1.24.2 hgu95av2.db_2.2.12 [4] ALL_1.4.5 topGO_1.12.0 SparseM_0.80 [7] GO.db_2.2.11 RSQLite_0.7-2 DBI_0.2-4 [10] AnnotationDbi_1.6.1 Biobase_2.4.1 graph_1.22.2 loaded via a namespace (and not attached): [1] annotate_1.22.0 grid_2.9.0 lattice_0.17-25 MASS_7.2-48 [5] splines_2.9.0 survival_2.35-4 tools_2.9.0 xtable_1.5-5

ADD REPLY • link 14.6 years ago Adrian Alexa ▴ 400

Login before adding your answer.