GO terms for E. coli micro arrays (ecoliK12.db generation)

0

Entering edit mode

Gaspard Lequeux ▴ 20

@gaspard-lequeux-3505

Last seen 9.6 years ago

Hej, Has anybody succeeded in constructing an ecoliK12.db database with usable Gene Ontology annotations? topGO is an nice R package that works very well with the yeast genome, and I would like to use it with E. coli but almost no GO terms are apparently available for E. coli when using the tools provided by AnnotationDbi. No ecoliK12.db database exists in the repositories, but according to the documentation in the AnnotationDbi package, this should be very easy with the makeECOLICHIP_DB command from that package. However, code didn't run. First some modifications had to be done to the AnnotationDbi package. I downloaded the sourcecode (AnnotationDbi_1.6.0.tar.gz). I also made sure that ecoliK12.db0 was installed (version ecoliK12.db0_2.2.11.tar.gz was used). In the directory AnnotationDbi/R, the 2 following files were modified. sqlForge_baseMapBuilder.R: Comment out line 234: sql <- "INSERT INTO probe2gene SELECT DISTINCT m.probe_id, u.gene_id \ FROM min_other_rank as m INNER JOIN src.unigene as u WHERE \ m.gene_id=u.unigene_id;" and line 235: sqliteQuickSQL(db, sql) Otherwise one gets the error: RS-DBI driver: (error in statement: no such table: src.unigene) sqlForge_tableBuilder.R Comment out line 181: sqliteQuickSQL(db, "ANALYZE;") and lines 3179 and 3180: sqliteQuickSQL(db, "VACUUM probe_map;") sqliteQuickSQL(db, "ANALYZE;") Otherwise one gets the error: RS-DBI driver: (RS_SQLite_exec: could not execute1: attempt to write a readonly database) The package was retarred and installed with: R CMD INSTALL AnnotationDbi_1.6.0.tar.gz I also downloaded the annotation file for the ecoli2 array from affymetrix (E_coli_2.na28.annot.csv). R was started and the following commands were given (make sure the directory 'ecoliK12.db' exist, also the path to the site-library may vary): library(AnnotationDbi);library(ecoliK12.db0) makeECOLICHIP_DB(affy=TRUE,prefix='ecoliK12',fileName="E_coli_2.na28.a nnot.csv", baseMapType='eg',chipSrc='/usr/local/lib/R/site- library/ecoliK12.db0/extdata/chipsrc_ecoliK12.sqlite', chipMapSrc='/usr/local/lib/R/site- library/ecoliK12.db0/extdata/chipmapsrc_ecoliK12.sqlite', chipName='E_coli_2',outputDir='ecoliK12.db',version='2.2.11') In the ecoliK12.db directory, another ecoliK12.db directory was created by those R commands. This directory was tarred (tar -czf ecoliK12.db_2.2.11.tar.gz ecoliK12.db/) resulting in an installable package that technically works with topGO. But not many GO terms are associated with the probes; much less than the number of GO terms that can be found for each probe in the probe annotation file provided by affymetrix. The table below lists the number of GO terms found in the different tables for the three ontologies: MG1655: the number of GO terms annotated to the probes of MG1655, as found in the affymetrix probe annotation file (that array contains also probes for other E. coli; they are filtered out for this table). ecoliK12.db: the number of GO terms that are found in the database generated by the makeECOLICHIP_DB command from above. ecoliK12.db0: the number of GO terms that are found in the original database that makeECOLICHIP_DB uses for generating the ecoliK12.db database. It should be noted that also in that database, no evidence codes occur (the evidence column has everywhere the value '-'). GO_BP_all GO_CC_all GO_MF_all MG1655 9899 7023 17925 ecoliK12.db 6394 2999 211 ecoliK12.db0 33526 17367 1266 (the comparison was done with the _all tables from the database, to be able to compare with the affymetrix file) Why are there not more GO terms found in the ecoliK12.db? Using other baseMapType than 'eg' does not help. Only 'refseq' doesn't crash, but even less GO terms are obtained than with 'eg'. Furthermore, for refseq, I think some modification has to be done to the cleanRefSeqs function in sqlForge_baseMapBuilder.R (the line with baseMap[,2] = sub("\\.\\d+?$", "", baseMap[,2], perl=TRUE) should be changed to baseMap[,2] = sub("^[^_]*_([^_]*)_.*", "\\1", baseMap[,2], perl=TRUE)). Trying to add the GO terms of the affymetrix file afterwards to the database, doesn't work (no better results in topGO: still only few (less than 10) significant nodes when comparing aerobic with anaerobic grown cells giving more than 2000 differently expressed genes). A possible problem might be that affymetrix provides also the redundant GO terms to the probes and that I added all those to the GO_XX and GO_XX_all tables. The GO_XX tables should normally only contain the most specific GO terms. Is this a known problem? Should I give up doing GO analysis with topGO for E. coli? Or is there a workaround? The R version used is R version 2.7.1 (2008-06-23) on Debian (however for AnnotationDbi and ecoliK12.db0 the most recent versions were downloaded from the bioC website, together with their dependencies) . Thank you very much for any suggestions, Gaspard

Annotation GO Yeast db0 probe topGO AnnotationDbi Annotation GO Yeast db0 probe topGO • 1.7k views

ADD COMMENT • link updated 14.9 years ago by Marc Carlson ★ 7.2k • written 14.9 years ago by Gaspard Lequeux ▴ 20

0

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 7.7 years ago

United States

Hi Gaspard, Thank you for your input. The number of map-able GO terms in a chip package will almost always be a subset of the number of GO terms in the source database. This is just a a consequence of the fact that most chips don't measure all of the genes. I would love to improve this situation with the number of GO mappings, but the source I have for GO to entrez gene mappings (NCBI) has only provided the mappings you see here. This is also why there is no evidence code. I only have what NCBI gave me. I could consider using blast2GO mappings instead, but then the nature of the data changes. And so far we have only used blast2GO for those organisms where it was the only option. If you have another good primary source for such mappings, I would love to know about it. As you have discovered, not many people are using any of the E coli stuff yet for chip packages (in fact, you might be the 1st). Most people only use the organism packages org.EcK12.eg.db <http: www.bioconductor.org="" packages="" release="" data="" annotation="" html="" org="" .eck12.eg.db.html=""> or org.EcSakai.eg.db <http: www.bioconductor.org="" packages="" release="" data="" annotation="" html="" org="" .ecsakai.eg.db.html="">. But, since these are based on the same sources, the data representation will be similar. Finally, I would encourage you not to comment out the VACUUM an ANALYZE statements in AnnotationDbi when you make chip packages. Instead, I would recommend that you use more liberal write permissions so that your code can perform writes on your newly created DBs. Marc Gaspard Lequeux wrote: > > Hej, > > Has anybody succeeded in constructing an ecoliK12.db database with > usable Gene Ontology annotations? topGO is an nice R package that > works very well with the yeast genome, and I would like to use it with > E. coli but almost no GO terms are apparently available for E. coli > when using the tools provided by AnnotationDbi. > > > No ecoliK12.db database exists in the repositories, but according to > the documentation in the AnnotationDbi package, this should be very > easy with the makeECOLICHIP_DB command from that package. > > However, code didn't run. First some modifications had to be done to > the AnnotationDbi package. I downloaded the sourcecode > (AnnotationDbi_1.6.0.tar.gz). I also made sure that ecoliK12.db0 was > installed (version ecoliK12.db0_2.2.11.tar.gz was used). > > In the directory AnnotationDbi/R, the 2 following files were modified. > > > sqlForge_baseMapBuilder.R: > > Comment out line 234: > > sql <- "INSERT INTO probe2gene SELECT DISTINCT m.probe_id, u.gene_id \ > FROM min_other_rank as m INNER JOIN src.unigene as u WHERE \ > m.gene_id=u.unigene_id;" > > and line 235: > > sqliteQuickSQL(db, sql) > > Otherwise one gets the error: > > RS-DBI driver: (error in statement: no such table: src.unigene) > > > > sqlForge_tableBuilder.R > > Comment out line 181: > > sqliteQuickSQL(db, "ANALYZE;") > > and lines 3179 and 3180: > > sqliteQuickSQL(db, "VACUUM probe_map;") > sqliteQuickSQL(db, "ANALYZE;") > > Otherwise one gets the error: > > RS-DBI driver: (RS_SQLite_exec: could not execute1: attempt to write a > readonly database) > > > The package was retarred and installed with: > > R CMD INSTALL AnnotationDbi_1.6.0.tar.gz > > > I also downloaded the annotation file for the ecoli2 array from > affymetrix (E_coli_2.na28.annot.csv). > > R was started and the following commands were given (make sure the > directory 'ecoliK12.db' exist, also the path to the site-library may > vary): > > library(AnnotationDbi);library(ecoliK12.db0) > > makeECOLICHIP_DB(affy=TRUE,prefix='ecoliK12',fileName="E_coli_2.na28 .annot.csv", > > baseMapType='eg',chipSrc='/usr/local/lib/R/site- library/ecoliK12.db0/extdata/chipsrc_ecoliK12.sqlite', > > chipMapSrc='/usr/local/lib/R/site- library/ecoliK12.db0/extdata/chipmapsrc_ecoliK12.sqlite', > > chipName='E_coli_2',outputDir='ecoliK12.db',version='2.2.11') > > In the ecoliK12.db directory, another ecoliK12.db directory was > created by those R commands. This directory was tarred (tar -czf > ecoliK12.db_2.2.11.tar.gz ecoliK12.db/) resulting in an installable > package that technically works with topGO. > > But not many GO terms are associated with the probes; much less than > the number of GO terms that can be found for each probe in the probe > annotation file provided by affymetrix. > > The table below lists the number of GO terms found in the different > tables for the three ontologies: > > MG1655: the number of GO terms annotated to the probes of MG1655, as > found in the affymetrix probe annotation file (that array contains > also probes for other E. coli; they are filtered out for this table). > > ecoliK12.db: the number of GO terms that are found in the database > generated by the makeECOLICHIP_DB command from above. > > ecoliK12.db0: the number of GO terms that are found in the original > database that makeECOLICHIP_DB uses for generating the ecoliK12.db > database. It should be noted that also in that database, no evidence > codes occur (the evidence column has everywhere the value '-'). > > GO_BP_all GO_CC_all GO_MF_all > MG1655 9899 7023 17925 > ecoliK12.db 6394 2999 211 > ecoliK12.db0 33526 17367 1266 > > (the comparison was done with the _all tables from the database, to be > able to compare with the affymetrix file) > > Why are there not more GO terms found in the ecoliK12.db? Using other > baseMapType than 'eg' does not help. Only 'refseq' doesn't crash, but > even less GO terms are obtained than with 'eg'. Furthermore, for > refseq, I think some modification has to be done to the cleanRefSeqs > function in sqlForge_baseMapBuilder.R (the line with baseMap[,2] = > sub("\\.\\d+?$", "", baseMap[,2], perl=TRUE) should be changed to > baseMap[,2] = sub("^[^_]*_([^_]*)_.*", "\\1", baseMap[,2], perl=TRUE)). > > Trying to add the GO terms of the affymetrix file afterwards to the > database, doesn't work (no better results in topGO: still only few > (less than 10) significant nodes when comparing aerobic with anaerobic > grown cells giving more than 2000 differently expressed genes). > > A possible problem might be that affymetrix provides also the > redundant GO terms to the probes and that I added all those to the > GO_XX and GO_XX_all tables. The GO_XX tables should normally only > contain the most specific GO terms. > > Is this a known problem? Should I give up doing GO analysis with topGO > for E. coli? Or is there a workaround? > > The R version used is R version 2.7.1 (2008-06-23) on Debian (however > for AnnotationDbi and ecoliK12.db0 the most recent versions were > downloaded from the bioC website, together with their dependencies) . > > Thank you very much for any suggestions, > > Gaspard > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 14.9 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Hej Marc, On Fri, 12 Jun 2009, Marc Carlson wrote: > Thank you for your input. Thank you for answering! > The number of map-able GO terms in a chip package will almost always be > a subset of the number of GO terms in the source database. This is just > a a consequence of the fact that most chips don't measure all of the > genes. Seems reasonable... > I would love to improve this situation with the number of GO mappings, > but the source I have for GO to entrez gene mappings (NCBI) has only > provided the mappings you see here. This is also why there is no > evidence code. I only have what NCBI gave me. >From geneontology.org it seems that the GO annotation for E. coli K12 is taken from ecocyc (http://www.geneontology.org/gene-associations/readme/EcoCyc.README and also http://geneontology.org/GO.current.annotations.shtml). Is there a possibility of using this mapping into R? > I could consider using blast2GO mappings instead, but > then the nature of the data changes. And so far we have only used > blast2GO for those organisms where it was the only option. If you have > another good primary source for such mappings, I would love to know > about it. The data provided by affymetrix? They have an entrez gene mapping for 4026 probes (from the 6000 probes for MG1655) >> GO_BP_all GO_CC_all GO_MF_all >> MG1655 9899 7023 17925 MG1655 & eg 2072 2732 2872 >> ecoliK12.db 6394 2999 211 >> ecoliK12.db0 33526 17367 1266 Again the table from below, but now I added a row: "MG1655 & eg" giving the number of GO terms that have an entrez gene mappings (in the file provided by affymetrix). As you can see, for BP and CC this is worse than the results obtained by constructing ecoliK12.db. But for MF this is significantly better. But are the mappings between entrez gene numbers and GO terms really needed? Affymetrix provides the probe2GO mappings and that is all what I really need (the other mappings are nice, but are not mandatory). However I don't know how to differentiate between the most specific GO terms and the more general ones (that don't have to be included in the GO_BP, GO_CC and GO_MF table). Is that information that is included in the NCBI database? Or is there a R function that can calculate that? And is it possible to add information to the tables in the db database when that information is not available in the db0 database? Or with other words, if I want to use another mapping than the one from NCBI (ecocyc), should I also construct a new db0 database (and how?)? > As you have discovered, not many people are using any of the > E coli stuff yet for chip packages (in fact, you might be the 1st). > Most people only use the organism packages org.EcK12.eg.db > <http: www.bioconductor.org="" packages="" release="" data="" annotation="" html="" o="" rg.eck12.eg.db.html=""> > or org.EcSakai.eg.db > <http: www.bioconductor.org="" packages="" release="" data="" annotation="" html="" o="" rg.ecsakai.eg.db.html="">. > But, since these are based on the same sources, the data representation > will be similar. > Finally, I would encourage you not to comment out the > VACUUM an ANALYZE statements in AnnotationDbi when you make chip > packages. Instead, I would recommend that you use more liberal write > permissions so that your code can perform writes on your newly created > DBs. I never changed the default write permissions and the resulting database is writable (unix permissions) for me. So I wondered indeed why I got a write permission error. I don't know much about SQL but I figured out that both commands don't altered the results returned and are just efficiency issues that I could fix later. Kind regards, Gaspard > Gaspard Lequeux wrote: >> >> Hej, >> >> Has anybody succeeded in constructing an ecoliK12.db database with >> usable Gene Ontology annotations? topGO is an nice R package that >> works very well with the yeast genome, and I would like to use it with >> E. coli but almost no GO terms are apparently available for E. coli >> when using the tools provided by AnnotationDbi. >> >> >> No ecoliK12.db database exists in the repositories, but according to >> the documentation in the AnnotationDbi package, this should be very >> easy with the makeECOLICHIP_DB command from that package. >> >> However, code didn't run. First some modifications had to be done to >> the AnnotationDbi package. I downloaded the sourcecode >> (AnnotationDbi_1.6.0.tar.gz). I also made sure that ecoliK12.db0 was >> installed (version ecoliK12.db0_2.2.11.tar.gz was used). >> >> In the directory AnnotationDbi/R, the 2 following files were modified. >> >> >> sqlForge_baseMapBuilder.R: >> >> Comment out line 234: >> >> sql <- "INSERT INTO probe2gene SELECT DISTINCT m.probe_id, u.gene_id \ >> FROM min_other_rank as m INNER JOIN src.unigene as u WHERE \ >> m.gene_id=u.unigene_id;" >> >> and line 235: >> >> sqliteQuickSQL(db, sql) >> >> Otherwise one gets the error: >> >> RS-DBI driver: (error in statement: no such table: src.unigene) >> >> >> >> sqlForge_tableBuilder.R >> >> Comment out line 181: >> >> sqliteQuickSQL(db, "ANALYZE;") >> >> and lines 3179 and 3180: >> >> sqliteQuickSQL(db, "VACUUM probe_map;") >> sqliteQuickSQL(db, "ANALYZE;") >> >> Otherwise one gets the error: >> >> RS-DBI driver: (RS_SQLite_exec: could not execute1: attempt to write a >> readonly database) >> >> >> The package was retarred and installed with: >> >> R CMD INSTALL AnnotationDbi_1.6.0.tar.gz >> >> >> I also downloaded the annotation file for the ecoli2 array from >> affymetrix (E_coli_2.na28.annot.csv). >> >> R was started and the following commands were given (make sure the >> directory 'ecoliK12.db' exist, also the path to the site-library may >> vary): >> >> library(AnnotationDbi);library(ecoliK12.db0) >> >> makeECOLICHIP_DB(affy=TRUE,prefix='ecoliK12',fileName="E_coli_2.na2 8.annot.csv", >> >> baseMapType='eg',chipSrc='/usr/local/lib/R/site- library/ecoliK12.db0/extdata/chipsrc_ecoliK12.sqlite', >> >> chipMapSrc='/usr/local/lib/R/site- library/ecoliK12.db0/extdata/chipmapsrc_ecoliK12.sqlite', >> >> chipName='E_coli_2',outputDir='ecoliK12.db',version='2.2.11') >> >> In the ecoliK12.db directory, another ecoliK12.db directory was >> created by those R commands. This directory was tarred (tar -czf >> ecoliK12.db_2.2.11.tar.gz ecoliK12.db/) resulting in an installable >> package that technically works with topGO. >> >> But not many GO terms are associated with the probes; much less than >> the number of GO terms that can be found for each probe in the probe >> annotation file provided by affymetrix. >> >> The table below lists the number of GO terms found in the different >> tables for the three ontologies: >> >> MG1655: the number of GO terms annotated to the probes of MG1655, as >> found in the affymetrix probe annotation file (that array contains >> also probes for other E. coli; they are filtered out for this table). >> >> ecoliK12.db: the number of GO terms that are found in the database >> generated by the makeECOLICHIP_DB command from above. >> >> ecoliK12.db0: the number of GO terms that are found in the original >> database that makeECOLICHIP_DB uses for generating the ecoliK12.db >> database. It should be noted that also in that database, no evidence >> codes occur (the evidence column has everywhere the value '-'). >> >> GO_BP_all GO_CC_all GO_MF_all >> MG1655 9899 7023 17925 >> ecoliK12.db 6394 2999 211 >> ecoliK12.db0 33526 17367 1266 >> >> (the comparison was done with the _all tables from the database, to be >> able to compare with the affymetrix file) >> >> Why are there not more GO terms found in the ecoliK12.db? Using other >> baseMapType than 'eg' does not help. Only 'refseq' doesn't crash, but >> even less GO terms are obtained than with 'eg'. Furthermore, for >> refseq, I think some modification has to be done to the cleanRefSeqs >> function in sqlForge_baseMapBuilder.R (the line with baseMap[,2] = >> sub("\\.\\d+?$", "", baseMap[,2], perl=TRUE) should be changed to >> baseMap[,2] = sub("^[^_]*_([^_]*)_.*", "\\1", baseMap[,2], perl=TRUE)). >> >> Trying to add the GO terms of the affymetrix file afterwards to the >> database, doesn't work (no better results in topGO: still only few >> (less than 10) significant nodes when comparing aerobic with anaerobic >> grown cells giving more than 2000 differently expressed genes). >> >> A possible problem might be that affymetrix provides also the >> redundant GO terms to the probes and that I added all those to the >> GO_XX and GO_XX_all tables. The GO_XX tables should normally only >> contain the most specific GO terms. >> >> Is this a known problem? Should I give up doing GO analysis with topGO >> for E. coli? Or is there a workaround? >> >> The R version used is R version 2.7.1 (2008-06-23) on Debian (however >> for AnnotationDbi and ecoliK12.db0 the most recent versions were >> downloaded from the bioC website, together with their dependencies) . >> >> Thank you very much for any suggestions, >> >> Gaspard >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > >

ADD REPLY • link 14.9 years ago Gaspard Lequeux ▴ 20

Login before adding your answer.