building a new annotation
2
0
Entering edit mode
Chirag Patel ▴ 10
@chirag-patel-3336
Last seen 7.2 years ago
Hello, I would like to build a new annotation using data from the CTD (http://ctd.mdibl.org ). This data contains in sqlite DB a main table with the schema: entrez_gene_id, chemical_id, relation_id, and pubmed_citation_id. Relation_id is a internal id I use to manage relations between the chemical and genes. Chemical_id is an id used by the CTD to identify chemicals. How may I best do this using the tools available on bioconductor?-- I was thinking of using AnnBuilder or AnnotationDbi, but am unsure if this is the right way to go; this is a first time building a package or an annotation. Any help would be much appreciated, Chirag
2
Entering edit mode
Marc Carlson ★ 7.2k
@marc-carlson-2264
Last seen 5.3 years ago
United States
0
Entering edit mode
Hi Chirag, You don't give us a lot of details about the schema of your database. You just say there is a main table. That suggests there are other tables, but how do they relate to the main table? How tables relate to each other is very important and is what we've formalized with the notion of L2Rchain (left-to-right chain) and L2Rlink objects in AnnotationDbi. The idea is that any map we define in our .db packages can be described with an (L2R) chain of (L2R) links. This is a high level description of the map. The advantage of such description is that: (a) Defining a map doesn't require you to write any SQL statement. The SQL code is automatically generated from the high-level description when the user queries the map (with mget(), keys(), get(), etc...). (b) It's easy to define new maps. (c) Some operations/transformations on maps are easier to do at the high level (e.g. adding/modifying a filter, plugging maps together, etc..., unfortunately those operations are not available for yet). So what's an L2R chain? Here is an imaginary database: table1 table2 table3 ------ ------ ------ col1a col2a col3a col1b col2b col3b col1c col2c col2d A map can be seen as a path that goes from any column in the db (e.g. table1.col1c) to any other column in the db (e.g. table2.col2b). The L2R chain describes the path that must be followed to go from table1.col1c (the leftmost col of the map) to table2.col2b (the rightmost col of the map). This path is described with 1 or more L2R links. For example, mapA could be described with 3 links: 1st link: table1.col1c -> table1.col1a 2nd link: table3.col3a -> table3.col3b 3rd link: table2.col2d -> table2.col2b Note that the left and right columns of a given link always belong to the same table. The simplest kind of map is mapping 2 columns of the same table and is described with just 1 link. To define this kind of map, just use Marc's createSimpleBimap() function. But what happens between links? What does it mean that link 1 [table1.col1c -> table1.col1a] is followed by link 2 [table3.col3a -> table3.col3b]? It means that columns table1.col1a and table3.col3a are in relation i.e. that the values they contain are of the same type and referring to the same entities. Most of the time, this will appear explicitly in the SQL schema: there will be a foreign key between the 2 columns, but not always. Also, most of the time, the 2 columns in relation will have the same name, but not always. In the end it's up to you to decide whether it makes sense or not to put 2 columns in relation. When it's time to extract data from the map, each relation between 2 links will be translated into an SQL join. For example, when extracting all the data from mapA (with 'toTable()' or 'as.list()'), an SQL statement will be generated that will more or less look like this: SELECT table1.col1c,table2.col2b FROM table1 INNER JOIN table3 ON table1.col1a=table3.col3a INNER JOIN table2 ON table3.col3b=table2.col2d; (In practice, things are a little bit more complicated. To see exactly what's generated, turn on SQL debugging mode with AnnotationDbi:::debugSQL()) If you look at the hgu95av2ENZYME map in hgu95av2.db: > str(hgu95av2ENZYME) Formal class 'AnnDbBimap' [package "AnnotationDbi"] with 8 slots ..@ L2Rchain :List of 2 .. ..$:Formal class 'L2Rlink' [package "AnnotationDbi"] with 8 slots .. .. .. ..@ tablename : chr "probes" .. .. .. ..@ Lcolname : chr "probe_id" .. .. .. ..@ tagname : chr NA .. .. .. ..@ Rcolname : chr "_id" .. .. .. ..@ Rattribnames: chr(0) .. .. .. ..@ Rattrib_join: chr NA .. .. .. ..@ filter : chr "1" .. .. .. ..@ altDB : chr(0) .. ..$ :Formal class 'L2Rlink' [package "AnnotationDbi"] with 8 slots .. .. .. ..@ tablename : chr "ec" .. .. .. ..@ Lcolname : chr "_id" .. .. .. ..@ tagname : chr NA .. .. .. ..@ Rcolname : chr "ec_number" .. .. .. ..@ Rattribnames: chr(0) .. .. .. ..@ Rattrib_join: chr NA .. .. .. ..@ filter : chr "1" .. .. .. ..@ altDB : chr(0) ..@ direction : int 1 ..@ Lkeys : chr NA ..@ Rkeys : chr NA ..@ ifnotfound: list() ..@ datacache :<environment: 0x2413308=""> ..@ objName : chr "ENZYME" ..@ objTarget : chr "chip hgu95av2" You can see it has 2 links: [probes.probe_id -> probes._id] [ec._id -> ec.ec_number] The probes._id and ec._id columns both contain internal gene ids i.e. arbitrary integers that we use within the scope of the hgu95av2.db package to uniquely refer to genes (the mapping between this internal id and the real Entrez id is stored in the 'genes' table). So the 2 columns are naturally in relation. Most maps in hgu95av2.db are made of two L2R links. But hgu95av2ACCNUM for example is made of one link only. Some maps in GO.db are made of 3 links where the leftmost and rightmost columns belong to the same table (but the path between them goes thru another table). Look at the R/createAnnObjs.*_DB.R files in AnnotationDbi, they contain the L2Rchain/L2Rlink description of all the predefined maps that you find in our .db packages. For example createAnnObjs.HUMANCHIP_DB.R contains the definition of all the maps found in hgu95av2.db (and any other .db package based on the HUMANCHIP_DB schema, use 'dbmeta(hgu95av2_dbconn(), "DBSCHEMA")' to get the name of the underlying db schema). Those map definitions are stored in the HUMANCHIP_DB_AnnDbBimap_seeds object (list of lists of etc... there are many nested levels). You'll need to reproduce something like this in your own annotation package and then call AnnotationDbi:::createAnnDbBimaps() on it to create the maps. Look at the code for the details. There are a lot of details to take care of but I can't cover them all here. Hope this gets you started. Let us know if you need further help. Cheers, H. Marc Carlson wrote: > Hi Chirag, > > createSimpleBimap is really meant for the case where someone is using an > custom annotation package that they have generated using SQLForge (you > don't want to do that), and they have added a single table which > contains all the information that they wish to represent. In this very > simple case, createSimpleBimap() will add a mapping to your package. > But otherwise you will probably want to have a look at (as an example) > the createAnnObjs.HUMANCHIP_DB.R in the AnnotationDbi package, and also > at the zzz.R inside the hgu95av2.db package for an example of how these > mappings can be set up. If you look at these examples you will see some > L2RChains being used to define the set of mappings needed for a package. > > Please keep the conversation "on list" so that others can benefit from > your questions. And while we are on that topic, this conversation would > probably be a better fit on the bioc-devel mailing list than here. > Because you are really talking about defining a new set of interfaces > for interacting with a completely different SQLite database schema than > anything else we support. And actually, you really might not need to > make a set of mappings at all. You might instead just want to write > some simple functions to retrieve pertinent data from the database. I > still don't know which of the data in this database you want to use or > what you want to do with it, so it's difficult for me to really advise > you on what is more appropriate at this time. > > > Marc > > > > > Chirag Patel wrote: >> Marc, >> Thanks so much for your response... AnnotationDbi may be the way to go >> for me. >> >> I have a couple of more questions. I am working through the >> vigenette, and I am having trouble understanding how the objects are >> mapped to the underlying db. How exactly do we create these objects? >> I am guessing that I should start with 'createSimpleBimap'. >> >> For example, if we use the example of the affy annotation db, >> "hgu95av2.db", we have the bimpa objects hgu95av2ACCNUM, >> hgu95av2ALIAS2PROBE, etc... >> >> How do we specify these objects? >> >> And what is the 'L2Rchain' structure you talk about below? >> >> Thanks, >> >> Chirag >> >> >> >> On Mar 12, 2009, at 10:38 AM, Marc Carlson wrote: >> >>> Hi Chirag, >>> >>> If you are building this to a custom database that you already have in >>> hand the you cannot use SQLforge because that will try to make a >>> customized database for you. And AnnBuilder is gone now (and would not >>> have helped you here anyways). Instead, you might want to look closely >>> at the code in AnnotationDbi which defines several types of databases >>> along with the mappings to represent the underlying DB data in R using >>> an L2Rchain structure. Access to these structures outside the domain of >>> AnnotationDbi is planned to be made more accessible in the future. >>> Alternatively, (depending completely upon what kind of access you want >>> to provide to your users), you could also pretty easily just write some >>> simple accessors to talk to this database. Direct access to SQLite >>> databases is pretty straightforward from R using the RSQlite and DBI >>> packages. There are some examples of this in the AnnotationDbi vignette >>> of this direct style of access that you can look at here. >>> >>> http://www.bioconductor.org/packages/devel/bioc/html/AnnotationDbi .html >>> >>> If you have further questions please let me know, >>> >>> >>> Marc >>> >>> >>> >>> >>> Chirag Patel wrote: >>>> Hello, >>>> I would like to build a new annotation using data from the CTD >>>> (http://ctd.mdibl.org). >>>> >>>> This data contains in sqlite DB a main table with the schema: >>>> entrez_gene_id, chemical_id, relation_id, and pubmed_citation_id. >>>> >>>> Relation_id is a internal id I use to manage relations between the >>>> chemical and genes. Chemical_id is an id used by the CTD to identify >>>> chemicals. >>>> >>>> How may I best do this using the tools available on bioconductor?-- I >>>> was thinking of using AnnBuilder or AnnotationDbi, but am unsure if >>>> this is the right way to go; this is a first time building a package >>>> or an annotation. >>>> >>>> Any help would be much appreciated, >>>> >>>> Chirag >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
0
Entering edit mode
Dear All, before seeing Herv?'s excellent tutorial, I put together a page documenting creating an annotation package with a new database schema. I am sure it has some flaws, but hopefully not very serious ones. I did it mostly for myself, but maybe it is useful for others, especially together with Herv?'s explanation. It is here: http://www2.unil.ch/cbg/index.php?title=Building_BioConductor_Annotati on_Packages Best, Gabor On Fri, Mar 13, 2009 at 4:47 AM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > Hi Chirag, [...] -- Gabor Csardi <gabor.csardi at="" unil.ch=""> UNIL DGM
0
Entering edit mode
Hi Gabor, Thanks for writing this document and sharing it. It's amazing that you've figured out how to make your own AnnDbBimap-based annotation packages using an entirely new db schema. You've also managed to define maps with a lot of right attributes (miRNATargetAnnDbBimap and miRNAAnnDbBimap classes) and to write appropriate "as.list" methods for them. Kudos for that! Let me just mention that yes you can make multiple inserts with a single SQL statement with RSQLite by using a "prepared query". See this post from Seth Falcon on the R-sig-DB mailing list for how to do this: https://stat.ethz.ch/pipermail/r-sig-db/2006q4/000234.html This will make the "Put the data into the tables" step much much faster! Cheers, H. G?bor Cs?rdi wrote: > Dear All, before seeing Herv?'s excellent tutorial, I put together a > page documenting creating an annotation package with a new database > schema. I am sure it has some flaws, but hopefully not very serious > ones. I did it mostly for myself, but maybe it is useful for others, > especially together with Herv?'s explanation. > > It is here: > http://www2.unil.ch/cbg/index.php?title=Building_BioConductor_Annota tion_Packages > > Best, > Gabor > > On Fri, Mar 13, 2009 at 4:47 AM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >> Hi Chirag, > [...] > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
0
Entering edit mode
Marc Carlson ★ 7.2k
@marc-carlson-2264
Last seen 5.3 years ago
United States
Hi Chirag, If you are building this to a custom database that you already have in hand the you cannot use SQLforge because that will try to make a customized database for you. And AnnBuilder is gone now (and would not have helped you here anyways). Instead, you might want to look closely at the code in AnnotationDbi which defines several types of databases along with the mappings to represent the underlying DB data in R using an L2Rchain structure. Access to these structures outside the domain of AnnotationDbi is planned to be made more accessible in the future. Alternatively, (depending completely upon what kind of access you want to provide to your users), you could also pretty easily just write some simple accessors to talk to this database. Direct access to SQLite databases is pretty straightforward from R using the RSQlite and DBI packages. There are some examples of this in the AnnotationDbi vignette of this direct style of access that you can look at here. http://www.bioconductor.org/packages/devel/bioc/html/AnnotationDbi.htm l If you have further questions please let me know, Marc Chirag Patel wrote: > Hello, > I would like to build a new annotation using data from the CTD > (http://ctd.mdibl.org). > > This data contains in sqlite DB a main table with the schema: > entrez_gene_id, chemical_id, relation_id, and pubmed_citation_id. > > Relation_id is a internal id I use to manage relations between the > chemical and genes. Chemical_id is an id used by the CTD to identify > chemicals. > > How may I best do this using the tools available on bioconductor?-- I > was thinking of using AnnBuilder or AnnotationDbi, but am unsure if > this is the right way to go; this is a first time building a package > or an annotation. > > Any help would be much appreciated, > > Chirag > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >