problems generating a gene2GOlist in topGO
1
0
Entering edit mode
@antonio-miguel-de-jesus-domingues-5182
Last seen 11 hours ago
Germany
Dear Bioconductor list, I have a list of genes from a mouse array (custom design) for which I want to perform an analysis with topGO. The package example is running fine and I have read the vignettes (though I've probably missed something) but when running my own data an error is generated that seems to be related to my custom Gene-to-GO map. The results are a table with several annotations and custom measure of significance. I've created a named vector (list) containing all the genes present in the array (ensembl IDs) with the corresponding measure of significance - geneList. geneList <- abs(data[ ,2]) names(geneList) <- data[ ,1] geneList[1:5] ENSMUSG00000025903 ENSMUSG00000025903 ENSMUSG00000025903 ENSMUSG00000025903 ENSMUSG00000033813 0.11 0.36 0.32 0.07 0.08 is(geneList) [1] "numeric" "vector" "atomic" "EnumerationValue" "numeric or NULL" "vectorORfactor" summary(geneList) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0100 0.0600 0.2000 0.4568 0.5600 18.1600 # a function was then defined to select the significant genes - as in the vignette topDiffGenes <- function(allScore) { return(allScore > 1) } x <- topDiffGenes(geneList) sum(x) # so far so good # because this is a custom array the GO annotation was extracted from ensemble using BiomaRt. # ensembl61 was used because of the gene format in my results ensembl61=useMart('ENSEMBL_MART_ENSEMBL',dataset='mmusculus_gene_ensem bl', host='feb2011.archive.ensembl.org') test.GO.BP <- getBM(attributes = c("ensembl_gene_id", "go_biological_process_id"), filters = "ensembl_gene_id", values = All.genes.Ens, mart = ensembl61) head(test.GO.BP) ensembl_gene_id go_biological_process_id 1 ENSMUSG00000054310 GO:0006355 2 ENSMUSG00000054728 3 ENSMUSG00000021368 GO:0032313 4 ENSMUSG00000021368 GO:0031398 5 ENSMUSG00000051335 GO:0055114 6 ENSMUSG00000051335 GO:0008152 # but when creating the topGO object a problem appears: GOdata <- new("topGOdata", description = "GO analysis Test", ontology = "BP", allGenes = geneList, geneSel = topDiffGenes, annot = annFUN.gene2GO, nodeSize = 5, gene2GO = test.GO.BP) Building most specific GOs ..... ( 0 GO terms found. ) Build GO DAG topology .......... ( 0 GO terms and 0 relations. ) Error in if is.na(index) || index < 0 || index > length(nd)) stop(paste("selected vertex", : missing value where TRUE/FALSE needed >From reading the vignette I think that the object test.GO.BP, a data.frame, needs to be convert to a list in which each gene corresponds to several GO terms: List of 6 $068724: chr [1:5] "GO:0005488" "GO:0003774" "GO:0001539" "GO:0006935" ...$ 119608: chr [1:6] "GO:0005634" "GO:0030528" "GO:0006355" "GO:0045449" ... $049239: chr [1:13] "GO:0016787" "GO:0017057" "GO:0005975" "GO:0005783" ...$ 067829: chr [1:16] "GO:0045926" "GO:0016616" "GO:0000287" "GO:0030145" ... $106331: chr [1:10] "GO:0043565" "GO:0000122" "GO:0003700" "GO:0005634" ...$ 214717: chr [1:7] "GO:0004803" "GO:0005634" "GO:0008270" "GO:0003677" ... Is this what I need to do next? If how to do it? Or is it something else? Any help will be appreciated. Session info: > sessionInfo() R version 2.14.2 (2012-02-29) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C/en_US.UTF-8/C/C/C/C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] plyr_1.7.1 genefilter_1.36.0 hgu95av2_2.2.0 hgu95av2.db_2.6.3 [5] org.Hs.eg.db_2.6.4 affyio_1.22.0 affydata_1.11.15 affy_1.32.1 [9] multtest_2.10.0 ALL_1.4.11 topGO_2.6.0 SparseM_0.91 [13] GO.db_2.6.1 graph_1.32.0 mogene10sttranscriptcluster.db_8.0.1 org.Mm.eg.db_2.6.4 [17] RSQLite_0.11.1 DBI_0.2-5 AnnotationDbi_1.16.19 Biobase_2.14.0 [21] BiocInstaller_1.2.1 biomaRt_2.10.0 Biostrings_2.22.0 GenomicRanges_1.6.7 [25] IRanges_1.12.6 loaded via a namespace (and not attached): [1] MASS_7.3-17 RColorBrewer_1.0-5 RCurl_1.91-1 XML_3.9-4 annotate_1.32.3 colorspace_1.1-1 dichromat_1.2-4 [8] digest_0.5.2 ggplot2_0.9.0 grid_2.14.2 lattice_0.20-6 memoise_0.1 munsell_0.3 preprocessCore_1.16.0 [15] proto_0.3-9.2 reshape2_1.2.1 scales_0.2.0 splines_2.14.2 stringr_0.6 survival_2.36-12 tools_2.14.2 [22] xtable_1.7-0 zlibbioc_1.0.1 -- -- António Miguel de Jesus Domingues, PhD Neugebauer group Max Planck Institute of Molecular Cell Biology and Genetics, Dresden Pfotenhauerstrasse 108 01307 Dresden Germany e-mail: domingue@mpi-cbg.de tel. +49 351 210 2481 The Unbearable Lightness of Molecular Biology [[alternative HTML version deleted]]
0
Entering edit mode
Last seen 7.1 years ago
Hi Antonio, you are right, the main problem is with the "test.GO.BP" object. It must be a list of mappings from genes to GO terms. You can obtain such a list from your data.frame object by (code not tested): > gene.to.GO <- split(test.GO.BP$go_biological_process_id, test.GO.BP$ensembl_gene_id) > gene.to.GO <- lapply(gene.to.GO, unique) # to remove duplicates This will give you a named list, where the list names are the Ensembl gene identifiers, and the list entries are the GO terms annotated with the respective gene. There is another problem with your data. The list of gene scores "geneList" contains duplicated names as I can see from your output (ENSMUSG00000025903 appears 4 times with different scores). This is not allowed in topGO, and you should find a way to remove the duplicates. Hope this helps. Regard, Adrian Alexa On Wed, Mar 21, 2012 at 3:43 PM, Ant?nio Miguel de Jesus Domingues <amjdomingues at="" gmail.com=""> wrote: > Dear Bioconductor list, > > I have a list of genes from a mouse array (custom design) for which I want > to perform an analysis with topGO. The package example is running fine and > I have read the vignettes (though I've probably missed something) but when > running my own data an error is generated that seems to be related to my > custom Gene-to-GO map. > > The results are a table with several annotations and custom measure of > significance. I've created a named vector (list) containing all the genes > present in the array (ensembl IDs) with the corresponding measure of > significance - geneList. > > geneList <- abs(data[ ,2]) > names(geneList) <- data[ ,1] > geneList[1:5] > ENSMUSG00000025903 ENSMUSG00000025903 ENSMUSG00000025903 ENSMUSG00000025903 > ENSMUSG00000033813 > 0.11 ? ? ? ? ? ? ? 0.36 ? ? ? ? ? ? ? 0.32 ? ? ? ? ? ? ? 0.07 > 0.08 > > is(geneList) > [1] "numeric" ? ? ? ? ?"vector" ? ? ? ? ? "atomic" > "EnumerationValue" "numeric or NULL" ?"vectorORfactor" > > summary(geneList) > Min. 1st Qu. ?Median ? ?Mean 3rd Qu. ? ?Max. > 0.0100 ?0.0600 ?0.2000 ?0.4568 ?0.5600 18.1600 > > # a function was then defined to select the significant genes - as in the > vignette > > topDiffGenes <- function(allScore) { > ?return(allScore > 1) > ?} > > > x <- topDiffGenes(geneList) > sum(x) > > # so far so good > # because this is a custom array the GO annotation was extracted from > ensemble using BiomaRt. > # ensembl61 was used because of the gene format in my results > > ensembl61=useMart('ENSEMBL_MART_ENSEMBL',dataset='mmusculus_gene_ens embl', > ? ? ? ? ? ? ? ? ?host='feb2011.archive.ensembl.org') > > test.GO.BP <- getBM(attributes = c("ensembl_gene_id", > "go_biological_process_id"), filters = "ensembl_gene_id", values = > All.genes.Ens, > ? ? ? ? ? ? ? ? mart = ensembl61) > head(test.GO.BP) > > ensembl_gene_id go_biological_process_id > 1 ENSMUSG00000054310 ? ? ? ? ? ? ? GO:0006355 > 2 ENSMUSG00000054728 > 3 ENSMUSG00000021368 ? ? ? ? ? ? ? GO:0032313 > 4 ENSMUSG00000021368 ? ? ? ? ? ? ? GO:0031398 > 5 ENSMUSG00000051335 ? ? ? ? ? ? ? GO:0055114 > 6 ENSMUSG00000051335 ? ? ? ? ? ? ? GO:0008152 > > # but when creating the topGO object a problem appears: > > GOdata <- new("topGOdata", > ? ? ? ? ? ? ?description = "GO analysis Test", > ? ? ? ? ? ? ?ontology = "BP", > ? ? ? ? ? ? ?allGenes = geneList, > ? ? ? ? ? ? ?geneSel = topDiffGenes, > ? ? ? ? ? ? ?annot = annFUN.gene2GO, > ? ? ? ? ? ? ?nodeSize = 5, > ? ? ? ? ? ? ?gene2GO = test.GO.BP) > > Building most specific GOs ..... ( 0 GO terms found. ) > > Build GO DAG topology .......... ( 0 GO terms and 0 relations. ) > Error in if is.na(index) || index < 0 || index > length(nd)) > stop(paste("selected vertex", ?: > ?missing value where TRUE/FALSE needed > > >From reading the vignette I think that the object test.GO.BP, a data.frame, > needs to be convert to a list in which each gene corresponds ?to several GO > terms: > > List of 6 > $068724: chr [1:5] "GO:0005488" "GO:0003774" "GO:0001539" "GO:0006935" ... >$ 119608: chr [1:6] "GO:0005634" "GO:0030528" "GO:0006355" "GO:0045449" ... > $049239: chr [1:13] "GO:0016787" "GO:0017057" "GO:0005975" "GO:0005783" ... >$ 067829: chr [1:16] "GO:0045926" "GO:0016616" "GO:0000287" "GO:0030145" ... > $106331: chr [1:10] "GO:0043565" "GO:0000122" "GO:0003700" "GO:0005634" ... >$ 214717: chr [1:7] "GO:0004803" "GO:0005634" "GO:0008270" "GO:0003677" ... > > Is this what I need to do next? If how to do it? Or is it something else? > > Any help will be appreciated. > > Session info: >> sessionInfo() > R version 2.14.2 (2012-02-29) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] C/en_US.UTF-8/C/C/C/C > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > other attached packages: > ?[1] plyr_1.7.1 ? ? ? ? ? ? ? ? ? ? ? ? ? genefilter_1.36.0 > ? hgu95av2_2.2.0 ? ? ? ? ? ? ? ? ? ? ? hgu95av2.db_2.6.3 > ?[5] org.Hs.eg.db_2.6.4 ? ? ? ? ? ? ? ? ? affyio_1.22.0 > ? affydata_1.11.15 ? ? ? ? ? ? ? ? ? ? affy_1.32.1 > ?[9] multtest_2.10.0 ? ? ? ? ? ? ? ? ? ? ?ALL_1.4.11 > ? ?topGO_2.6.0 ? ? ? ? ? ? ? ? ? ? ? ? ?SparseM_0.91 > > [13] GO.db_2.6.1 ? ? ? ? ? ? ? ? ? ? ? ? ?graph_1.32.0 > ? ?mogene10sttranscriptcluster.db_8.0.1 org.Mm.eg.db_2.6.4 > > [17] RSQLite_0.11.1 ? ? ? ? ? ? ? ? ? ? ? DBI_0.2-5 > ? AnnotationDbi_1.16.19 ? ? ? ? ? ? ? ?Biobase_2.14.0 > [21] BiocInstaller_1.2.1 ? ? ? ? ? ? ? ? ?biomaRt_2.10.0 > ? ?Biostrings_2.22.0 ? ? ? ? ? ? ? ? ? ?GenomicRanges_1.6.7 > > [25] IRanges_1.12.6 > > loaded via a namespace (and not attached): > ?[1] MASS_7.3-17 ? ? ? ? ? RColorBrewer_1.0-5 ? ?RCurl_1.91-1 > ?XML_3.9-4 ? ? ? ? ? ? annotate_1.32.3 ? ? ? colorspace_1.1-1 > ?dichromat_1.2-4 > ?[8] digest_0.5.2 ? ? ? ? ?ggplot2_0.9.0 ? ? ? ? grid_2.14.2 > lattice_0.20-6 ? ? ? ?memoise_0.1 ? ? ? ? ? munsell_0.3 > preprocessCore_1.16.0 > [15] proto_0.3-9.2 ? ? ? ? reshape2_1.2.1 ? ? ? ?scales_0.2.0 > ?splines_2.14.2 ? ? ? ?stringr_0.6 ? ? ? ? ? survival_2.36-12 > ?tools_2.14.2 > [22] xtable_1.7-0 ? ? ? ? ?zlibbioc_1.0.1 > > -- > -- > Ant?nio Miguel de Jesus Domingues, PhD > Neugebauer group > Max Planck Institute of Molecular Cell Biology and Genetics, Dresden > Pfotenhauerstrasse 108 > 01307 Dresden > Germany > > e-mail: domingue at mpi-cbg.de > tel. +49 351 210 2481 > The Unbearable Lightness of Molecular Biology > > ? ? ? ?[[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor