disagreement between biomaRt and topGO

0

Entering edit mode

Dick Beyer ★ 1.4k

@dick-beyer-26

Last seen 9.6 years ago

Hello, I have a puzzling problem with a disagreement between the numbers of EntrezGene IDs I get from biomaRt and what I get from topGO for the GO category GO:0006836, neurotransmitter transport. I would assume the mismatch would be due to different GO builds, but what I find seems too extreme for that. Let me give a little background to how I get my list of Entrez Gene IDs. I start with the file ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz from which I extract IPI, EG, Symbol, description, UniGene This eventually gives me a list of 22963 unique Entrez Gene IDs. I process this list through biomaRt as follows: library(biomaRt) mart <- useMart( "ensembl", dataset="hsapiens_gene_ensembl") get.go.biomart <- getGO(id=ipi.LL.sym.descrip.ug.unique_eg[,2],type="entrezgene",mart=ma rt) >From this data.frame I can determine that there are 32 unique Entrez Gene IDs in the Gene Ontology Biological Process category GO:0006836, neurotransmitter transport. Everything is fine so far. Now I set up things for a topGO (non-affy) calculation: gene2GO <- as.list(get.go.biomart$go) names(gene2GO) <- get.go.biomart$entrezgene geneList <- list() geneList[1] <- list(factor(as.integer(ipi.LL.sym.descrip.ug.unique_eg[,2] %in% LL1027[,2]))) names(geneList[[1]]) <- ipi.LL.sym.descrip.ug.unique_eg[,2] This geneList[[1]] has, among its 22963 entries, 32 entries (that is, 32 of the "names" are Entrez Gene IDs) that are in category GO:0006836, and 4 of entries I have marked as significant by setting their values to 1. Next, I create a topGOdata Biological Process object: BP.GOdata <- list() BP.GOdata[1i] <- list(new("topGOdata", ontology = "BP", allGenes = geneList[[1]], annot = annFUN.gene2GO, gene2GO = gene2GO)) and submit this topGOdata object to the following: test.stat.BP <- list() result.BP <- list() test.stat.BP[1] <- list(new("classicCount", testStatistic = GOFisherTest, name = "Fisher test")) result.BP[1] <- list(getSigGroups(BP.GOdata, test.stat.BP[[1]])) test.stat.BP[2] <- list(new("elimCount", testStatistic = GOFisherTest, name = "Fisher test", cutOff = 0.01)) result.BP[2] <- list(getSigGroups(BP.GOdata, test.stat.BP[[2]])) test.stat.BP[3] <- list(new("weightCount", testStatistic = GOFisherTest, name = "Fisher test", sigRatio = "ratio")) result.BP[3] <- list(getSigGroups(BP.GOdata, test.stat.BP[[3]])) l <- list(classic = score(result.BP[[1]]), elim = score(result.BP[[2]]), weight = score(result.BP[[3]])) allRes.BP <- genTable(BP.GOdata, l, orderBy = "classic", ranksOf = "classic", top = length(score(result.BP[[1]]))) My resulting data.frame, allRes.BP, shows that for category GO:0006836, there are only 11 annotated Entrez Gene IDs, not 32. The call to genTable() does correctly find that 4 of these are significant. This latest released version of topGO must be using the Mar 14 11:46:06 2007 build of GO, and I assume biomaRt is more current as I did the query just today. If category GO:0006836, neurotransmitter transport, has grown from 11 EGs to 32 EGs between now and last March for Human, then everything is as it should be. But this seems unlikely to me. So I wonder if there is a bug in the topGO calculations. Here is my session: sessionInfo() R version 2.5.1 (2007-06-27) i386-pc-mingw32 locale: LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 attached base packages: [1] "tools" "stats" "graphics" "grDevices" "utils" "datasets" "methods" "base" other attached packages: goTools hgu133plus2 Rgraphviz geneplotter lattice topGO SparseM annotate GO graph Biobase biomaRt RCurl XML "1.8.0" "1.16.0" "1.14.1" "1.14.0" "0.15-11" "1.2.1" "0.73" "1.14.1" "1.16.0" "1.14.2" "1.14.1" "1.10.1" "0.8-0" "1.9-0" Thanks very much, Dick ********************************************************************** ********* Richard P. Beyer, Ph.D. University of Washington Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 Seattle, WA 98105-6099 http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html http://staff.washington.edu/~dbeyer

GO hgu133plus2 annotate geneplotter graph Rgraphviz PROcess goTools biomaRt Category GO • 2.1k views

ADD COMMENT • link updated 16.6 years ago by Steffen ▴ 500 • written 16.7 years ago by Dick Beyer ★ 1.4k

0

Entering edit mode

Steffen ▴ 500

@steffen-2351

Last seen 9.6 years ago

Hi Dick, I can't help you with the topGO part of your question but a faster way to get the EntrezGene identifiers associated with a certain GO id is as follows: ids = getBM(c("entrezgene","go"), filters="go", values="GO:0006836", mart = mart) It looks like there are indeed 32 EntrezGene identifiers associated with this term in the current release of Ensembl. Cheers, Steffen Dick Beyer wrote: > Hello, > > I have a puzzling problem with a disagreement between the numbers of EntrezGene IDs I get from biomaRt and what I get from topGO for the GO category GO:0006836, neurotransmitter transport. I would assume the mismatch would be due to different GO builds, but what I find seems too extreme for that. > > Let me give a little background to how I get my list of Entrez Gene IDs. > > I start with the file ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz > > from which I extract IPI, EG, Symbol, description, UniGene > > This eventually gives me a list of 22963 unique Entrez Gene IDs. I process this list through biomaRt as follows: > > library(biomaRt) > mart <- useMart( "ensembl", dataset="hsapiens_gene_ensembl") > get.go.biomart <- > getGO(id=ipi.LL.sym.descrip.ug.unique_eg[,2],type="entrezgene",mart= mart) > > >From this data.frame I can determine that there are 32 unique Entrez Gene IDs in the Gene Ontology Biological Process category GO:0006836, neurotransmitter transport. Everything is fine so far. > > Now I set up things for a topGO (non-affy) calculation: > > gene2GO <- as.list(get.go.biomart$go) > names(gene2GO) <- get.go.biomart$entrezgene > > geneList <- list() > geneList[1] <- list(factor(as.integer(ipi.LL.sym.descrip.ug.unique_eg[,2] %in% LL1027[,2]))) > names(geneList[[1]]) <- ipi.LL.sym.descrip.ug.unique_eg[,2] > > This geneList[[1]] has, among its 22963 entries, 32 entries (that is, 32 of the "names" are Entrez Gene IDs) that are in category GO:0006836, and 4 of entries I have marked as significant by setting their values to 1. > > Next, I create a topGOdata Biological Process object: > > BP.GOdata <- list() > BP.GOdata[1i] <- list(new("topGOdata", ontology = "BP", allGenes = geneList[[1]], > annot = annFUN.gene2GO, gene2GO = gene2GO)) > > and submit this topGOdata object to the following: > > test.stat.BP <- list() > result.BP <- list() > test.stat.BP[1] <- list(new("classicCount", testStatistic = GOFisherTest, name = "Fisher test")) > result.BP[1] <- list(getSigGroups(BP.GOdata, test.stat.BP[[1]])) > test.stat.BP[2] <- list(new("elimCount", testStatistic = GOFisherTest, name = "Fisher test", cutOff = 0.01)) > result.BP[2] <- list(getSigGroups(BP.GOdata, test.stat.BP[[2]])) > test.stat.BP[3] <- list(new("weightCount", testStatistic = GOFisherTest, name = "Fisher test", sigRatio = "ratio")) > result.BP[3] <- list(getSigGroups(BP.GOdata, test.stat.BP[[3]])) > l <- list(classic = score(result.BP[[1]]), elim = score(result.BP[[2]]), weight = score(result.BP[[3]])) > allRes.BP <- genTable(BP.GOdata, l, orderBy = "classic", ranksOf = "classic", top = length(score(result.BP[[1]]))) > > My resulting data.frame, allRes.BP, shows that for category GO:0006836, there are only 11 annotated Entrez Gene IDs, not 32. The call to genTable() does correctly find that 4 of these are significant. > > This latest released version of topGO must be using the Mar 14 11:46:06 2007 build of GO, and I assume biomaRt is more current as I did the query just today. If category GO:0006836, neurotransmitter transport, has grown from 11 EGs to 32 EGs between now and last March for Human, then everything is as it should be. But this seems unlikely to me. So I wonder if there is a bug in the topGO calculations. > > Here is my session: > > > sessionInfo() > R version 2.5.1 (2007-06-27) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 > > attached base packages: > [1] "tools" "stats" "graphics" "grDevices" "utils" "datasets" "methods" "base" > > other attached packages: > goTools hgu133plus2 Rgraphviz geneplotter lattice topGO SparseM annotate GO graph Biobase biomaRt RCurl XML > "1.8.0" "1.16.0" "1.14.1" "1.14.0" "0.15-11" "1.2.1" "0.73" "1.14.1" "1.16.0" "1.14.2" "1.14.1" "1.10.1" "0.8-0" "1.9-0" > > Thanks very much, > Dick > ******************************************************************** *********** > Richard P. Beyer, Ph.D. University of Washington > Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 > Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 > Seattle, WA 98105-6099 > http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html > http://staff.washington.edu/~dbeyer > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD COMMENT • link 16.7 years ago Steffen ▴ 500

0

Entering edit mode

Hi Steffen, Thanks for the good tip. I was getting my huge table as I needed EGs with GOs for all GO terms, but I haven't used getBM() before. Cheers, Dick ********************************************************************** ********* Richard P. Beyer, Ph.D. University of Washington Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 Seattle, WA 98105-6099 http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html http://staff.washington.edu/~dbeyer ********************************************************************** ********* On Wed, 5 Sep 2007, Steffen wrote: > Hi Dick, > > I can't help you with the topGO part of your question but a faster way to get > the EntrezGene identifiers associated with a certain GO id is as follows: > > ids = getBM(c("entrezgene","go"), filters="go", values="GO:0006836", mart = > mart) > > It looks like there are indeed 32 EntrezGene identifiers associated with this > term in the current release of Ensembl. > > Cheers, > Steffen > > Dick Beyer wrote: >> Hello, >> >> I have a puzzling problem with a disagreement between the numbers of >> EntrezGene IDs I get from biomaRt and what I get from topGO for the GO >> category GO:0006836, neurotransmitter transport. I would assume the mismatch >> would be due to different GO builds, but what I find seems too extreme for >> that. >> >> Let me give a little background to how I get my list of Entrez Gene IDs. >> >> I start with the file >> ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz >> >> from which I extract IPI, EG, Symbol, description, UniGene >> >> This eventually gives me a list of 22963 unique Entrez Gene IDs. I process >> this list through biomaRt as follows: >> >> library(biomaRt) >> mart <- useMart( "ensembl", dataset="hsapiens_gene_ensembl") >> get.go.biomart <- >> getGO(id=ipi.LL.sym.descrip.ug.unique_eg[,2],type="entrezgene",mart =mart) >> >> >From this data.frame I can determine that there are 32 unique Entrez Gene >> IDs in the Gene Ontology Biological Process category GO:0006836, >> neurotransmitter transport. Everything is fine so far. >> >> Now I set up things for a topGO (non-affy) calculation: >> >> gene2GO <- as.list(get.go.biomart$go) >> names(gene2GO) <- get.go.biomart$entrezgene >> >> geneList <- list() >> geneList[1] <- list(factor(as.integer(ipi.LL.sym.descrip.ug.unique_eg[,2] >> %in% LL1027[,2]))) >> names(geneList[[1]]) <- ipi.LL.sym.descrip.ug.unique_eg[,2] >> >> This geneList[[1]] has, among its 22963 entries, 32 entries (that is, 32 of >> the "names" are Entrez Gene IDs) that are in category GO:0006836, and 4 of >> entries I have marked as significant by setting their values to 1. >> >> Next, I create a topGOdata Biological Process object: >> >> BP.GOdata <- list() BP.GOdata[1i] <- list(new("topGOdata", ontology = "BP", >> allGenes = geneList[[1]], >> annot = annFUN.gene2GO, gene2GO = gene2GO)) >> >> and submit this topGOdata object to the following: >> >> test.stat.BP <- list() >> result.BP <- list() >> test.stat.BP[1] <- list(new("classicCount", testStatistic = >> GOFisherTest, name = "Fisher test")) >> result.BP[1] <- list(getSigGroups(BP.GOdata, test.stat.BP[[1]])) >> test.stat.BP[2] <- list(new("elimCount", testStatistic = GOFisherTest, >> name = "Fisher test", cutOff = 0.01)) >> result.BP[2] <- list(getSigGroups(BP.GOdata, test.stat.BP[[2]])) >> test.stat.BP[3] <- list(new("weightCount", testStatistic = >> GOFisherTest, name = "Fisher test", sigRatio = "ratio")) >> result.BP[3] <- list(getSigGroups(BP.GOdata, test.stat.BP[[3]])) >> l <- list(classic = score(result.BP[[1]]), elim = >> score(result.BP[[2]]), weight = score(result.BP[[3]])) >> allRes.BP <- genTable(BP.GOdata, l, orderBy = "classic", ranksOf = >> "classic", top = length(score(result.BP[[1]]))) >> >> My resulting data.frame, allRes.BP, shows that for category GO:0006836, >> there are only 11 annotated Entrez Gene IDs, not 32. The call to genTable() >> does correctly find that 4 of these are significant. >> >> This latest released version of topGO must be using the Mar 14 11:46:06 2007 >> build of GO, and I assume biomaRt is more current as I did the query just >> today. If category GO:0006836, neurotransmitter transport, has grown from >> 11 EGs to 32 EGs between now and last March for Human, then everything is as >> it should be. But this seems unlikely to me. So I wonder if there is a bug >> in the topGO calculations. >> >> Here is my session: >> >> >> sessionInfo() >> R version 2.5.1 (2007-06-27) i386-pc-mingw32 >> >> locale: >> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United >> States.1252;LC_MONETARY=English_United >> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 >> >> attached base packages: >> [1] "tools" "stats" "graphics" "grDevices" "utils" "datasets" >> "methods" "base" >> >> other attached packages: >> goTools hgu133plus2 Rgraphviz geneplotter lattice topGO >> SparseM annotate GO graph Biobase biomaRt >> RCurl XML >> "1.8.0" "1.16.0" "1.14.1" "1.14.0" "0.15-11" "1.2.1" >> "0.73" "1.14.1" "1.16.0" "1.14.2" "1.14.1" "1.10.1" >> "0.8-0" "1.9-0" >> >> Thanks very much, >> Dick >> ******************************************************************* ************ >> Richard P. Beyer, Ph.D. University of Washington >> Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 >> Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 >> Seattle, WA 98105-6099 >> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html >> http://staff.washington.edu/~dbeyer >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >>

ADD REPLY • link 16.7 years ago Dick Beyer ★ 1.4k

0

Entering edit mode

Steffen ▴ 500

@steffen-2351

Last seen 9.6 years ago

Hi Dick, I looked a bit more in detail into the disagreement and it seems like the biomaRt and GO packages map largely the same entrezgene ids to your GO term (GO:0006836). Here's what I did: I used the GO package to find the entrezgene ids that map to GO:0006836, I get about 138 unique entrezgene ids using the commands: library(GO) entrez = unique(get("GO:0006836",GOENTREZID)). It looks like these entrezgene ids are a mix of entrezgene ids from all species. Of these 138 entrezgene ids, 31 match to the output of our getBM query in human, which was: mart=useMart("ensembl",dataset="hsapiens_gene_ensembl") ids = getBM(c("entrezgene","go"), filters=c("go","with_entrezgene"), values=list("GO:0006836",TRUE), mart = mart) The getBM query gave 32 unique entrezgene identifiers. The only entrezgene id that we found with getBM and that was not found with the GO package is: 594855 In Ensembl this identifier maps to the uniprot accession Q8WVH0 and the quickgo search at http://www.ebi.ac.uk/ego/ associated the GO:0006836 id with Q8WVH0. Thus the GO mapping in Ensembl is correct. Probably this discrepancy is caused by a version difference between the mappings in the GO package and Ensembl. I'm not sure how topGO uses the GO package, is it possible there is a problem in topGO that causes only 11 of these 30+ entrezgene ids to appear in your analysis? Or did your analysis in topGO filtered out some of the entrezgene ids? I haven't used topGO before so I'm not sure how to interpret the example analysis you showed. > sessionInfo() R version 2.5.1 (2007-06-27) i386-apple-darwin8.9.1 locale: en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] "tools" "stats" "graphics" "grDevices" "utils" "datasets" "methods" "base" other attached packages: biomaRt RCurl XML GO geneplotter lattice annotate Biobase RMySQL "1.11.6" "0.8-0" "1.9-0" "1.16.0" "1.14.0" "0.15-11" "1.14.1" "1.14.1" "0.6-0" DBI "0.2-3" Cheers, Steffen Dick Beyer wrote: > Hello, > > I have a puzzling problem with a disagreement between the numbers of EntrezGene IDs I get from biomaRt and what I get from topGO for the GO category GO:0006836, neurotransmitter transport. I would assume the mismatch would be due to different GO builds, but what I find seems too extreme for that. > > Let me give a little background to how I get my list of Entrez Gene IDs. > > I start with the file ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz > > from which I extract IPI, EG, Symbol, description, UniGene > > This eventually gives me a list of 22963 unique Entrez Gene IDs. I process this list through biomaRt as follows: > > library(biomaRt) > mart <- useMart( "ensembl", dataset="hsapiens_gene_ensembl") > get.go.biomart <- > getGO(id=ipi.LL.sym.descrip.ug.unique_eg[,2],type="entrezgene",mart= mart) > > >From this data.frame I can determine that there are 32 unique Entrez Gene IDs in the Gene Ontology Biological Process category GO:0006836, neurotransmitter transport. Everything is fine so far. > > Now I set up things for a topGO (non-affy) calculation: > > gene2GO <- as.list(get.go.biomart$go) > names(gene2GO) <- get.go.biomart$entrezgene > > geneList <- list() > geneList[1] <- list(factor(as.integer(ipi.LL.sym.descrip.ug.unique_eg[,2] %in% LL1027[,2]))) > names(geneList[[1]]) <- ipi.LL.sym.descrip.ug.unique_eg[,2] > > This geneList[[1]] has, among its 22963 entries, 32 entries (that is, 32 of the "names" are Entrez Gene IDs) that are in category GO:0006836, and 4 of entries I have marked as significant by setting their values to 1. > > Next, I create a topGOdata Biological Process object: > > BP.GOdata <- list() > BP.GOdata[1i] <- list(new("topGOdata", ontology = "BP", allGenes = geneList[[1]], > annot = annFUN.gene2GO, gene2GO = gene2GO)) > > and submit this topGOdata object to the following: > > test.stat.BP <- list() > result.BP <- list() > test.stat.BP[1] <- list(new("classicCount", testStatistic = GOFisherTest, name = "Fisher test")) > result.BP[1] <- list(getSigGroups(BP.GOdata, test.stat.BP[[1]])) > test.stat.BP[2] <- list(new("elimCount", testStatistic = GOFisherTest, name = "Fisher test", cutOff = 0.01)) > result.BP[2] <- list(getSigGroups(BP.GOdata, test.stat.BP[[2]])) > test.stat.BP[3] <- list(new("weightCount", testStatistic = GOFisherTest, name = "Fisher test", sigRatio = "ratio")) > result.BP[3] <- list(getSigGroups(BP.GOdata, test.stat.BP[[3]])) > l <- list(classic = score(result.BP[[1]]), elim = score(result.BP[[2]]), weight = score(result.BP[[3]])) > allRes.BP <- genTable(BP.GOdata, l, orderBy = "classic", ranksOf = "classic", top = length(score(result.BP[[1]]))) > > My resulting data.frame, allRes.BP, shows that for category GO:0006836, there are only 11 annotated Entrez Gene IDs, not 32. The call to genTable() does correctly find that 4 of these are significant. > > This latest released version of topGO must be using the Mar 14 11:46:06 2007 build of GO, and I assume biomaRt is more current as I did the query just today. If category GO:0006836, neurotransmitter transport, has grown from 11 EGs to 32 EGs between now and last March for Human, then everything is as it should be. But this seems unlikely to me. So I wonder if there is a bug in the topGO calculations. > > Here is my session: > > > sessionInfo() > R version 2.5.1 (2007-06-27) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 > > attached base packages: > [1] "tools" "stats" "graphics" "grDevices" "utils" "datasets" "methods" "base" > > other attached packages: > goTools hgu133plus2 Rgraphviz geneplotter lattice topGO SparseM annotate GO graph Biobase biomaRt RCurl XML > "1.8.0" "1.16.0" "1.14.1" "1.14.0" "0.15-11" "1.2.1" "0.73" "1.14.1" "1.16.0" "1.14.2" "1.14.1" "1.10.1" "0.8-0" "1.9-0" > > Thanks very much, > Dick > ******************************************************************** *********** > Richard P. Beyer, Ph.D. University of Washington > Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 > Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 > Seattle, WA 98105-6099 > http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html > http://staff.washington.edu/~dbeyer > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD COMMENT • link 16.6 years ago Steffen ▴ 500

0

Entering edit mode

Hi Steffen, Thanks very much for your help. I have just finished the same sort of confirmation you did. I also found the source of my problem: me. If you use topGO for a non-affy analysis, you have to create a gene2GO object: "gene2GO named list of character vectors. The list names are genes identifiers. For each gene the character vector contains the GO terms IDs it maps to." What I created was one entry for each Entrez Gene/GOID pair, rather than one list of GOIDs for each Entrez Gene ID. Unfortunatley, this was really easy to do with the output I got from the getGO() function. So, I'm sorry my confusion caused you extra work. Your help did help me debug my problem, and for that I'm very grateful. Thanks very much, Dick ********************************************************************** ********* Richard P. Beyer, Ph.D. University of Washington Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 Seattle, WA 98105-6099 http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html http://staff.washington.edu/~dbeyer ********************************************************************** ********* On Fri, 7 Sep 2007, Steffen wrote: > Hi Dick, > > I looked a bit more in detail into the disagreement and it seems like the > biomaRt and GO packages map largely the same entrezgene ids to your GO term > (GO:0006836). > Here's what I did: > > I used the GO package to find the entrezgene ids that map to GO:0006836, I get > about 138 unique entrezgene ids using the commands: > > library(GO) > entrez = unique(get("GO:0006836",GOENTREZID)). > > It looks like these entrezgene ids are a mix of entrezgene ids from all > species. > Of these 138 entrezgene ids, 31 match to the output of our getBM query in > human, which was: > > mart=useMart("ensembl",dataset="hsapiens_gene_ensembl") > ids = getBM(c("entrezgene","go"), filters=c("go","with_entrezgene"), > values=list("GO:0006836",TRUE), mart = mart) > > The getBM query gave 32 unique entrezgene identifiers. > The only entrezgene id that we found with getBM and that was not found with the > GO package is: 594855 > In Ensembl this identifier maps to the uniprot accession Q8WVH0 and the > quickgo search at http://www.ebi.ac.uk/ego/ > associated the GO:0006836 id with Q8WVH0. Thus the GO mapping in Ensembl is > correct. Probably this discrepancy is caused by a version difference between > the mappings in the GO package and Ensembl. > > I'm not sure how topGO uses the GO package, is it possible there is a problem > in topGO that causes only 11 of these 30+ entrezgene ids to appear in your > analysis? Or did your analysis in topGO filtered out some of the entrezgene > ids? I haven't used topGO before so I'm not sure how to interpret the example > analysis you showed. > >> sessionInfo() > R version 2.5.1 (2007-06-27) > i386-apple-darwin8.9.1 > > locale: > en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] "tools" "stats" "graphics" "grDevices" "utils" "datasets" > "methods" "base" other attached packages: > biomaRt RCurl XML GO geneplotter lattice > annotate Biobase RMySQL > "1.11.6" "0.8-0" "1.9-0" "1.16.0" "1.14.0" "0.15-11" > "1.14.1" "1.14.1" "0.6-0" > DBI > "0.2-3" > > > Cheers, > Steffen > > > > > Dick Beyer wrote: >> Hello, >> >> I have a puzzling problem with a disagreement between the numbers of >> EntrezGene IDs I get from biomaRt and what I get from topGO for the GO >> category GO:0006836, neurotransmitter transport. I would assume the mismatch >> would be due to different GO builds, but what I find seems too extreme for >> that. >> >> Let me give a little background to how I get my list of Entrez Gene IDs. >> >> I start with the file >> ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz >> >> from which I extract IPI, EG, Symbol, description, UniGene >> >> This eventually gives me a list of 22963 unique Entrez Gene IDs. I process >> this list through biomaRt as follows: >> >> library(biomaRt) >> mart <- useMart( "ensembl", dataset="hsapiens_gene_ensembl") >> get.go.biomart <- >> getGO(id=ipi.LL.sym.descrip.ug.unique_eg[,2],type="entrezgene",mart =mart) >> >> >From this data.frame I can determine that there are 32 unique Entrez Gene >> IDs in the Gene Ontology Biological Process category GO:0006836, >> neurotransmitter transport. Everything is fine so far. >> >> Now I set up things for a topGO (non-affy) calculation: >> >> gene2GO <- as.list(get.go.biomart$go) >> names(gene2GO) <- get.go.biomart$entrezgene >> >> geneList <- list() >> geneList[1] <- list(factor(as.integer(ipi.LL.sym.descrip.ug.unique_eg[,2] >> %in% LL1027[,2]))) >> names(geneList[[1]]) <- ipi.LL.sym.descrip.ug.unique_eg[,2] >> >> This geneList[[1]] has, among its 22963 entries, 32 entries (that is, 32 of >> the "names" are Entrez Gene IDs) that are in category GO:0006836, and 4 of >> entries I have marked as significant by setting their values to 1. >> >> Next, I create a topGOdata Biological Process object: >> >> BP.GOdata <- list() BP.GOdata[1i] <- list(new("topGOdata", ontology = "BP", >> allGenes = geneList[[1]], >> annot = annFUN.gene2GO, gene2GO = gene2GO)) >> >> and submit this topGOdata object to the following: >> >> test.stat.BP <- list() >> result.BP <- list() >> test.stat.BP[1] <- list(new("classicCount", testStatistic = >> GOFisherTest, name = "Fisher test")) >> result.BP[1] <- list(getSigGroups(BP.GOdata, test.stat.BP[[1]])) >> test.stat.BP[2] <- list(new("elimCount", testStatistic = GOFisherTest, >> name = "Fisher test", cutOff = 0.01)) >> result.BP[2] <- list(getSigGroups(BP.GOdata, test.stat.BP[[2]])) >> test.stat.BP[3] <- list(new("weightCount", testStatistic = >> GOFisherTest, name = "Fisher test", sigRatio = "ratio")) >> result.BP[3] <- list(getSigGroups(BP.GOdata, test.stat.BP[[3]])) >> l <- list(classic = score(result.BP[[1]]), elim = >> score(result.BP[[2]]), weight = score(result.BP[[3]])) >> allRes.BP <- genTable(BP.GOdata, l, orderBy = "classic", ranksOf = >> "classic", top = length(score(result.BP[[1]]))) >> >> My resulting data.frame, allRes.BP, shows that for category GO:0006836, >> there are only 11 annotated Entrez Gene IDs, not 32. The call to genTable() >> does correctly find that 4 of these are significant. >> >> This latest released version of topGO must be using the Mar 14 11:46:06 2007 >> build of GO, and I assume biomaRt is more current as I did the query just >> today. If category GO:0006836, neurotransmitter transport, has grown from >> 11 EGs to 32 EGs between now and last March for Human, then everything is as >> it should be. But this seems unlikely to me. So I wonder if there is a bug >> in the topGO calculations. >> >> Here is my session: >> >> >> sessionInfo() >> R version 2.5.1 (2007-06-27) i386-pc-mingw32 >> >> locale: >> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United >> States.1252;LC_MONETARY=English_United >> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 >> >> attached base packages: >> [1] "tools" "stats" "graphics" "grDevices" "utils" "datasets" >> "methods" "base" >> >> other attached packages: >> goTools hgu133plus2 Rgraphviz geneplotter lattice topGO >> SparseM annotate GO graph Biobase biomaRt >> RCurl XML >> "1.8.0" "1.16.0" "1.14.1" "1.14.0" "0.15-11" "1.2.1" >> "0.73" "1.14.1" "1.16.0" "1.14.2" "1.14.1" "1.10.1" >> "0.8-0" "1.9-0" >> >> Thanks very much, >> Dick >> ******************************************************************* ************ >> Richard P. Beyer, Ph.D. University of Washington >> Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 >> Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 >> Seattle, WA 98105-6099 >> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html >> http://staff.washington.edu/~dbeyer >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >>

ADD REPLY • link 16.6 years ago Dick Beyer ★ 1.4k

Login before adding your answer.