Question

how to use select() to retrieve GO Biological Process Ancestors?

1

Entering edit mode

Guido Hooiveld ★ 3.9k

@guido-hooiveld-2020

Last seen 1 day ago

Wageningen University, Wageningen, the …

I would like to retrieve the Biological Process Ancestors of some GO IDs, but I can't get it to work using the select() method. The old mget() method still works, but AFAIK using this method is not recommended anymore (because of the handling of multiple 'hits'). Am i doing something wrong, or does it indeed not work (and is this still intentional)? Thanks! Guido

> library(GO.db)
> # which annotation info can be retrieved?
> columns(GO.db)#mmm, only 4...? 
[1] "DEFINITION" "GOID"       "ONTOLOGY"   "TERM"      
> 
> # Let's query for 2 of these 4.
> 
> # some random GOIDs
> keys <- c("GO:0003677", "GO:0003899", "GO:0006351", "GO:0009507", "GO:0032549", "GO:0031047")
> 
> # select does work, but is only able to retrieve few 'columns'?
> AnnotationDbi::select(GO.db, keys=keys, keytype="GOID", columns=c("TERM","ONTOLOGY") )
'select()' returned 1:1 mapping between keys and columns
        GOID                                       TERM ONTOLOGY
1 GO:0003677                                DNA binding       MF
2 GO:0003899 DNA-directed 5'-3' RNA polymerase activity       MF
3 GO:0006351               transcription, DNA-templated       BP
4 GO:0009507                                chloroplast       CC
5 GO:0032549                     ribonucleoside binding       MF
6 GO:0031047                      gene silencing by RNA       BP
> 
> # ... but this doesn't work!
> AnnotationDbi::select(GO.db, keys=keys, keytype="GOID", columns=c("GOMFANCESTOR") )
Error in .testForValidCols(x, cols) : 
  Invalid columns: GOMFANCESTOR. Please use the columns method to see a listing of valid arguments.
> 
> # Old way still does!
> mget(keys, GOMFANCESTOR, ifnotfound=NA)
$`GO:0003677`
[1] "GO:0003674" "GO:0003676" "GO:0005488" "GO:0097159" "GO:1901363"
[6] "all"       

$`GO:0003899`
[1] "GO:0003674" "GO:0003824" "GO:0016740" "GO:0016772" "GO:0016779"
[6] "GO:0034062" "GO:0097747" "GO:0140098" "all"       

$`GO:0006351`
[1] NA

$`GO:0009507`
[1] NA

$`GO:0032549`
[1] "GO:0001882" "GO:0003674" "GO:0005488" "GO:0036094" "GO:0097159"
[6] "GO:0097367" "GO:1901363" "all"       

$`GO:0031047`
[1] NA

> 
> 



> sessionInfo()
R version 4.0.3 Patched (2020-12-21 r79668)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] clusterProfiler_3.18.0 GO.db_3.12.1           AnnotationDbi_1.52.0  
[4] IRanges_2.24.1         S4Vectors_0.28.1       Biobase_2.50.0        
[7] BiocGenerics_0.36.0

GO.db AnnotationDbi • 1.9k views

ADD COMMENT • link updated 3.2 years ago by Martin Morgan 25k • written 3.2 years ago by Guido Hooiveld ★ 3.9k

score 2 · Answer 1 · 2021-01-27

I know this isn't what you want, or even 'best' or even 'good' practice, but it's kind of empowering to realize that the GO.db (and other annotation) package is a SQLite database with tables

library(dplyr)
con <- GO.db::GO_dbconn()

The tables are

> dbplyr::src_dbi(con)
src:  sqlite 3.34.1 [/Users/ma38727/Library/R/4.1/Bioc/3.13/library/GO.db/extdata/GO.sqlite]
tbls: go_bp_offspring, go_bp_parents, go_cc_offspring, go_cc_parents,
  go_mf_offspring, go_mf_parents, go_obsolete, go_ontology, go_synonym,
  go_term, map_counts, map_metadata, metadata, sqlite_stat1

A table has information from the GO hierarchy, as well as _id columns that represent unique keys in the go_term table. Here are two tables

go_term <- tbl(con, "go_term")
go_mf_offspring <- tbl(con, "go_mf_offspring")

The go_term table contains an internal id column, and then more-or-less familiar information

> go_term
# Source:   table<go_term> [?? x 5]
# Database: sqlite 3.34.1
#   [/Users/ma38727/Library/R/4.1/Bioc/3.13/library/GO.db/extdata/GO.sqlite]
   `_id` go_id   term                   ontology definition                     
   <int> <chr>   <chr>                  <chr>    <chr>                          
 1     2 GO:000… mitochondrion inherit… BP       The distribution of mitochondr…
 2     3 GO:000… mitochondrial genome … BP       The maintenance of the structu…
 3     4 GO:000… reproduction           BP       The production of new individu…
 4     6 GO:000… high-affinity zinc tr… MF       Enables the transfer of zinc i…
 5     7 GO:000… low-affinity zinc ion… MF       Enables the transfer of a solu…
 6     9 GO:000… alpha-1,6-mannosyltra… MF       Catalysis of the transfer of a…
 7    10 GO:000… trans-hexaprenyltrans… MF       Catalysis of the reaction: all…
 8    11 GO:000… vacuole inheritance    BP       The distribution of vacuoles i…
 9    12 GO:000… single strand break r… BP       The repair of single strand br…
10    13 GO:000… single-stranded DNA e… MF       Catalysis of the hydrolysis of…
# … with more rows

The go_mf_offspring table contains the relationship between a 'parent' _id, and the ids of all offspring (children and more distant relations) of the parent id.

> go_mf_offspring
# Source:   table<go_mf_offspring> [?? x 2]
# Database: sqlite 3.34.1
#   [/Users/ma38727/Library/R/4.1/Bioc/3.13/library/GO.db/extdata/GO.sqlite]
   `_id` `_offspring_id`
   <int>           <int>
 1     9           16486
 2     9           28973
 3     9           29066
 4    13           45591
 5    13           45593
 6    24            3286
 7    24           29067
 8    24           29075
 9    27               9
10    27              24
# … with more rows

So start with (a vector of) keys...

keys <- "GO:0003677"

...map to internal id

key2id <-
    go_term %>%
    filter(go_id %in% keys) %>%
    select(go_id, `_id`)

You are interested in the ancestors of ids, and the ancestors can be found by joining your ids to the offspring id in the go_mf_offspring table

id2ancestor <-
    left_join(key2id, go_mf_offspring, by = c(`_id` = "_offspring_id"))

Finally, map from ancestor id to go id, and clean up

ancestor2go_id <-
    left_join(id2ancestor, go_term, by = c(`_id.y` = "_id")) %>%
    rename(go_id = go_id.x, ancestor_go_id = go_id.y) %>%
    select(-c(`_id.x`, `_id.y`, ontology))

Here's a function...

mf_ancestors <-
    function(keys)
{
    con <- GO.db::GO_dbconn()
    go_term <- tbl(con, "go_term")
    go_mf_offspring <- tbl(con, "go_mf_offspring")

    ## keys to internal id
    keys2id <-
        go_term %>%
        filter(go_id %in% keys) %>%
        select(go_id, `_id`)

    ## internal id to ancestors
    id2ancestor <-
        left_join(keys2id, go_mf_offspring, by = c(`_id` = "_offspring_id"))

    ## ancestor id to go_id
    keys2ancestor <-
        left_join(id2ancestor, go_term, by = c(`_id.y` = "_id"))

    ## clean up and return
    keys2ancestor %>%
        rename(go_id = go_id.x, ancestor = go_id.y) %>%
        select(-c(`_id.x`, `_id.y`, ontology)) %>%
        collect()
}

... in action

> mf_ancestors(keys)

# A tibble: 6 x 4
  go_id    ancestor  term              definition                               
  <chr>    <chr>     <chr>             <chr>                                    
1 GO:0003… GO:00036… molecular_functi… A molecular process that can be carried …
2 GO:0003… GO:00036… nucleic acid bin… Interacting selectively and non-covalent…
3 GO:0003… GO:00054… binding           The selective, non-covalent, often stoic…
4 GO:0003… GO:00971… organic cyclic c… Interacting selectively and non-covalent…
5 GO:0003… GO:19013… heterocyclic com… NA                                       
6 GO:0003… all       all               NA

Here are your random keys

> keys <- c("GO:0003677", "GO:0003899", "GO:0006351", "GO:0009507", "GO:0032549", "GO:0031047")

> mf_ancestors(keys)
# A tibble: 26 x 4
   go_id   ancestor  term                      definition                       
   <chr>   <chr>     <chr>                     <chr>                            
 1 GO:000… GO:00036… molecular_function        A molecular process that can be …
 2 GO:000… GO:00036… nucleic acid binding      Interacting selectively and non-…
 3 GO:000… GO:00054… binding                   The selective, non-covalent, oft…
 4 GO:000… GO:00971… organic cyclic compound … Interacting selectively and non-…
 5 GO:000… GO:19013… heterocyclic compound bi… NA                               
 6 GO:000… all       all                       NA                               
 7 GO:000… GO:00036… molecular_function        A molecular process that can be …
 8 GO:000… GO:00038… catalytic activity        Catalysis of a biochemical react…
 9 GO:000… GO:00167… transferase activity      Catalysis of the transfer of a g…
10 GO:000… GO:00167… transferase activity, tr… Catalysis of the transfer of a p…
# … with 16 more rows

I think working with tibbles has distinct advantageous, but if a list-of-ancestors is desired, one could

> mf_ancestors(keys) %>% with(split(ancestor, go_id)[keys])
$`GO:0003677`
[1] "GO:0003674" "GO:0003676" "GO:0005488" "GO:0097159" "GO:1901363"
[6] "all"       

$`GO:0003899`
[1] "GO:0003674" "GO:0003824" "GO:0016740" "GO:0016772" "GO:0016779"
[6] "GO:0034062" "GO:0097747" "GO:0140098" "all"       

$`GO:0006351`
[1] NA

$`GO:0009507`
[1] NA

$`GO:0032549`
[1] "GO:0001882" "GO:0003674" "GO:0005488" "GO:0036094" "GO:0097159"
[6] "GO:0097367" "GO:1901363" "all"       

$`GO:0031047`
[1] NA