what's the difference between this two AnnotationDbi::select type in GO select
1
0
Entering edit mode
@shangguandong1996-21805
Last seen 4 weeks ago
China

Hi,

Acutally, I have two question.

  1. I want to extract a GO term realted gene, and I find two solution:
> AnnotationDbi::select(org.At.tair.db, 
+                       keys = "GO:0048449", 
+                       keytype = "GO", columns = "TAIR") %>% 
+   as_tibble()
'select()' returned 1:many mapping between keys and columns
# A tibble: 74 x 4
   GO         EVIDENCE ONTOLOGY TAIR     
   <chr>      <chr>    <chr>    <chr>    
 1 GO:0048449 RCA      BP       AT1G08130
 2 GO:0048449 RCA      BP       AT1G09450
 3 GO:0048449 RCA      BP       AT1G13030
 4 GO:0048449 RCA      BP       AT1G13980
 5 GO:0048449 RCA      BP       AT1G20410
 6 GO:0048449 RCA      BP       AT1G21690
 7 GO:0048449 RCA      BP       AT1G21880
 8 GO:0048449 RCA      BP       AT1G23000
 9 GO:0048449 RCA      BP       AT1G26190
10 GO:0048449 RCA      BP       AT1G33410
# … with 64 more rows
> AnnotationDbi::select(org.At.tair.db, 
+                       keys = "GO:0048449", 
+                       keytype = "GOALL", columns = "TAIR") %>% 
+   as_tibble()
'select()' returned 1:many mapping between keys and columns
# A tibble: 183 x 4
   GOALL      EVIDENCEALL ONTOLOGYALL TAIR     
   <chr>      <chr>       <chr>       <chr>    
 1 GO:0048449 RCA         BP          AT1G01370
 2 GO:0048449 RCA         BP          AT1G02800
 3 GO:0048449 RCA         BP          AT1G04050
 4 GO:0048449 RCA         BP          AT1G04760
 5 GO:0048449 RCA         BP          AT1G05440
 6 GO:0048449 RCA         BP          AT1G06420
 7 GO:0048449 RCA         BP          AT1G08130
 8 GO:0048449 RCA         BP          AT1G09450
 9 GO:0048449 RCA         BP          AT1G10980
10 GO:0048449 RCA         BP          AT1G13030
# … with 173 more rows

I am wondering why GOALL gives more genes ? I have seen the definition of GOALL, it says that

GOALL: GO Identifiers (includes less specific terms)

  1. The second question is that I use this two select, and it produce a very different result. I am wondering whether some one can give me some explantion. I am wondering whether the first select give me the parent of GO:0000003 and the second give me the child of GO:0000003.
> AnnotationDbi::select(org.At.tair.db, 
+                       keys = "GO:0000003", 
+                       keytype = "GO", columns = "GOALL") %>% 
+   as_tibble()
'select()' returned 1:many mapping between keys and columns
# A tibble: 2,055 x 6
   GO         EVIDENCE ONTOLOGY GOALL      EVIDENCEALL ONTOLOGYALL
   <chr>      <chr>    <chr>    <chr>      <chr>       <chr>      
 1 GO:0000003 RCA      BP       GO:0000003 IMP         BP         
 2 GO:0000003 RCA      BP       GO:0000003 RCA         BP         
 3 GO:0000003 RCA      BP       GO:0000280 RCA         BP         
 4 GO:0000003 RCA      BP       GO:0000375 RCA         BP         
 5 GO:0000003 RCA      BP       GO:0000377 RCA         BP         
 6 GO:0000003 RCA      BP       GO:0000398 RCA         BP         
 7 GO:0000003 RCA      BP       GO:0000902 RCA         BP         
 8 GO:0000003 RCA      BP       GO:0000904 RCA         BP         
 9 GO:0000003 RCA      BP       GO:0002252 RCA         BP         
10 GO:0000003 RCA      BP       GO:0002376 RCA         BP         
# … with 2,045 more rows
> AnnotationDbi::select(org.At.tair.db, 
+                       keys = "GO:0000003", 
+                       keytype = "GOALL", columns = "GO") %>% 
+   as_tibble()
'select()' returned 1:many mapping between keys and columns
# A tibble: 8,043 x 6
   GOALL      EVIDENCEALL ONTOLOGYALL GO         EVIDENCE ONTOLOGY
   <chr>      <chr>       <chr>       <chr>      <chr>    <chr>   
 1 GO:0000003 IMP         BP          GO:0003700 ISS      MF      
 2 GO:0000003 IMP         BP          GO:0005634 ISM      CC      
 3 GO:0000003 IMP         BP          GO:0006355 TAS      BP      
 4 GO:0000003 IMP         BP          GO:0009908 IMP      BP      
 5 GO:0000003 IMP         BP          GO:0048366 IMP      BP      
 6 GO:0000003 IMP         BP          GO:0000226 RCA      BP      
 7 GO:0000003 IMP         BP          GO:0000278 RCA      BP      
 8 GO:0000003 IMP         BP          GO:0000911 RCA      BP      
 9 GO:0000003 IMP         BP          GO:0003725 IDA      MF      
10 GO:0000003 IMP         BP          GO:0004525 ISS      MF      
# … with 8,033 more rows

Best wishes

Guandong Shang

GO.db AnnotationDbi • 226 views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 7 hours ago
United States

The difference between 'GO' and 'GOALL' is that the former gives you all the GO terms that are directly appended to a given gene (TAIR ID), and the latter gives you the direct GO terms as well as all of their offspring. Normally those two columns are meant to do the forward mapping (TAIR ID to GO ID), primarily for doing GO hypergeometric testing. If you do the reverse it's sort of weird. For GO -> TAIR, it's probably OK - you are asking for all the TAIR IDs that are directly mapped to a given GO. But if you do GOALL -> TAIR, you are asking for the TAIR IDs that map to a given GO ID and all of that GO ID's offspring. Which might be what you want as well.

Your last two select calls are problematic though, because you are mapping GO IDs through the intervening TAIR IDs, which seems to do weird things. I should also point out that I get completely different results than you do, which probably indicates you are using old R/Bioconductor versions. When posting questions you should always remember to include the output from running sessionInfo() so we know what you are using.

Anyway, GO:0000003 is 'reproduction', and is a child of 'biological process', so the only parent term will be 'biological process'. We can query the GO.db package directly to show that (or go to the AmiGO site, but that's boring - let's have fun!).

> library(GO.db)
> con <- GO_dbconn()
> dbListTables(con)
 [1] "go_bp_offspring" "go_bp_parents"   "go_cc_offspring" "go_cc_parents"  
 [5] "go_mf_offspring" "go_mf_parents"   "go_obsolete"     "go_ontology"    
 [9] "go_synonym"      "go_term"         "map_counts"      "map_metadata"   
[13] "metadata"        "sqlite_stat1"   

## we need the _id for GO:0000003

> dbGetQuery(con, "select _id, go_id from go_term where go_id='GO:0000003';")
  _id      go_id         
1   4 GO:0000003 

> dbGetQuery(con, "select go_id, term from go_term where _id in (select _parent_id from go_bp_parents where go_bp_parents._id='4');")
       go_id               term
1 GO:0008150 biological_process

So there is only one parent term for 'reproduction'. And all the offspring can be accessed directly as well.

> z <- dbGetQuery(con, "select go_id, term from go_term where _id in (select _offspring_id from go_bp_offspring where go_bp_offspring._id='4');")

## and your query that 'might' get the offspring

firstcall <- select(org.At.tair.db, "GO:0000003", "GOALL","GO")

> head(z)
       go_id                                                    term
1 GO:0000212                            meiotic spindle organization
2 GO:0000321 re-entry into mitotic cell cycle after pheromone arrest
3 GO:0000705                                    achiasmate meiosis I
4 GO:0000706              meiotic DNA double-strand break processing
5 GO:0000707                        meiotic DNA recombinase assembly
6 GO:0000708                                 meiotic strand invasion
> dim(firstcall)
[1] 814   6
> dim(z)
[1] 1276    2

## what's the overlap?

> sum(z[,1] %in% firstcall$GOALL)
[1] 22

So it looks like doing the GO -> GOALL mapping with the org.At.tair.db package is probably not doing what you want to do.

0
Entering edit mode

I hit enter too fast, so if you are following this via email you should come to the support site to see the entire answer.

ADD REPLY
0
Entering edit mode

Thanks for yor reply, James.

But I am confused about this sentence:

The difference between 'GO' and 'GOALL' is that the former gives you all the GO terms that are directly appended to a given gene (TAIR ID), and the latter gives you the direct GO terms as well as all of their offspring.

In my opinion, it should be the Ancestors instead of offspring. After all, the definition is GOALL: GO Identifiers (includes less specific terms).

And I also find a weird thing about GOALL in select. According to the definition, the GOALL includes less specific terms, so I believe it should the include direct GO and its ancestors GO. But I find select function will drop those GO without ancestors . Here is my three result to confirm this things. All I want to find is the related GO(direct and ancestors) of AT2G17950.

First, I will the use GOBPANCESTOR in GO.db, according to its definition:

This data set describes associations between GO Biological Process (BP) terms and their ancestor BP terms, based on the directed acyclic graph (DAG) defined by the Gene Ontology Consortium. The format is an R object mapping the GO BP terms to all ancestor terms, where an ancestor term is a more general GO term that precedes the given GO term in the DAG (in other words, the parents, and all their parents, etc.).

# First I use use TAIR ID -> GO ID to get direct GO
library(GO.db)
library(dplyr)
library(org.At.tair.db)

AnnotationDbi::select(org.At.tair.db, 
                      keys = "AT2G17950",
                      keytype = "TAIR", columns = "GO") %>%
  distinct(GO, .keep_all = TRUE) %>% 
  select(2,1) -> data

> data
           GO      TAIR
1  GO:0000976 AT2G17950
2  GO:0003700 AT2G17950
3  GO:0005515 AT2G17950
4  GO:0005634 AT2G17950
5  GO:0006355 AT2G17950
6  GO:0010492 AT2G17950
7  GO:0019827 AT2G17950
8  GO:0045471 AT2G17950
9  GO:0048653 AT2G17950
10 GO:0080166 AT2G17950
11 GO:0090506 AT2G17950
> nrow(data)
[1] 11

# Then I use GOBPANCESTOR to get ancestors BP
ans <- unlist(mget(data$GO, GOBPANCESTOR, ifnotfound=NA))
ans <- ans[ !is.na(ans) ]
ans <- ans[ans != "all"]
# merge direct and ancestors GO
ans <- c(ans, data$GO)
ans <- unique(ans)

> length(ans)
[1] 100

Second I use GOALL in select to get AT2G17950 all realted GO

# It still be TAIR ID -> GO
AnnotationDbi::select(org.At.tair.db, 
                      keys = "AT2G17950",
                      keytype = "TAIR", 
                      columns = "GOALL") %>% 
  filter(ONTOLOGYALL == "BP") %>% 
  distinct(GOALL, .keep_all = TRUE) %>% 
  as_tibble(.) -> ans_other

> nrow(ans_other)
[1] 96

Third, I will use GOALL -> TAIR ID

# First I will extract all GO term
goterms <- AnnotationDbi::Ontology(GO.db::GOTERM)
goterms <- goterms[goterms == "BP"]

# Then I use mapIds to extract related gene
goterms <- AnnotationDbi::Ontology(GO.db::GOTERM)
goterms <- goterms[goterms == "BP"]
go2gene <- suppressMessages(
  AnnotationDbi::mapIds(org.At.tair.db,
                        keys=names(goterms),
                        column="TAIR",
                        keytype="GOALL", multiVals='list')
)
goAnno <- stack(go2gene)
colnames(goAnno) <- c("TAIR", "GOALL")
goAnno <- unique(goAnno[!is.na(goAnno[,1]), ])

> nrow(subset(goAnno, TAIR == "AT2G17950"))
[1] 96

As you can see the second and third methods result are same:

# Third and Second
> intersect(subset(goAnno, TAIR == "AT2G17950")$GOALL,
+           ans_other$GOALL) %>% 
+   length()
[1] 96

But the number of First is 4 more than second or third

> sum(ans_other$GOALL %in% ans)
[1] 96
> ans[!ans %in% ans_other$GOALL]
[1] "GO:0000976" "GO:0003700" "GO:0005515" "GO:0005634"

And you can find these 4 GO do not have ancestors. But this 4 GO is the direct GO of AT2G17950

> mget(ans[!ans %in% ans_other$GOALL], GOBPANCESTOR, ifnotfound = NA)
$`GO:0000976`
[1] NA

$`GO:0003700`
[1] NA

$`GO:0005515`
[1] NA

$`GO:0005634`
[1] NA

# the data is the origin direct GO in the first method
> ans[!ans %in% ans_other$GOALL] %in% data$GO
[1] TRUE TRUE TRUE TRUE

By the way, I am sorry I do not post the session info. In this reply, I convert my Biocondutcot version into 3.13. Here is the sessionInfo

> session_info()
- Session info ------------------------------------------------------------------
 setting  value                         
 version  R version 4.1.0 (2021-05-18)  
 os       Windows 10 x64                
 system   x86_64, mingw32               
 ui       RStudio                       
 language (EN)                          
 collate  Chinese (Simplified)_China.936
 ctype    Chinese (Simplified)_China.936
 tz       Asia/Taipei                   
 date     2021-09-01                    

- Packages ----------------------------------------------------------------------
 package          * version  date       lib source        
 AnnotationDbi    * 1.54.1   2021-06-08 [1] Bioconductor  
 assertthat         0.2.1    2019-03-21 [1] CRAN (R 4.1.0)
 Biobase          * 2.52.0   2021-05-19 [1] Bioconductor  
 BiocGenerics     * 0.38.0   2021-05-19 [1] Bioconductor  
 Biostrings         2.60.2   2021-08-05 [1] Bioconductor  
 bit                4.0.4    2020-08-04 [1] CRAN (R 4.1.0)
 bit64              4.0.5    2020-08-30 [1] CRAN (R 4.1.0)
 bitops             1.0-7    2021-04-24 [1] CRAN (R 4.1.0)
 blob               1.2.2    2021-07-23 [1] CRAN (R 4.1.0)
 cachem             1.0.5    2021-05-15 [1] CRAN (R 4.1.0)
 callr              3.7.0    2021-04-20 [1] CRAN (R 4.1.0)
 cli                3.0.1    2021-07-17 [1] CRAN (R 4.1.0)
 crayon             1.4.1    2021-02-08 [1] CRAN (R 4.1.0)
 DBI                1.1.1    2021-01-15 [1] CRAN (R 4.1.0)
 desc               1.3.0    2021-03-05 [1] CRAN (R 4.1.0)
 devtools         * 2.4.1    2021-05-05 [1] CRAN (R 4.1.0)
 dplyr            * 1.0.6    2021-05-05 [1] CRAN (R 4.1.0)
 ellipsis           0.3.2    2021-04-29 [1] CRAN (R 4.1.0)
 fansi              0.5.0    2021-05-25 [1] CRAN (R 4.1.0)
 fastmap            1.1.0    2021-01-25 [1] CRAN (R 4.1.0)
 fs                 1.5.0    2020-07-31 [1] CRAN (R 4.1.0)
 generics           0.1.0    2020-10-31 [1] CRAN (R 4.1.0)
 GenomeInfoDb       1.28.2   2021-08-26 [1] Bioconductor  
 GenomeInfoDbData   1.2.6    2021-06-02 [1] Bioconductor  
 glue               1.4.2    2020-08-27 [1] CRAN (R 4.1.0)
 GO.db            * 3.13.0   2021-06-02 [1] Bioconductor  
 httr               1.4.2    2020-07-20 [1] CRAN (R 4.1.0)
 IRanges          * 2.26.0   2021-05-19 [1] Bioconductor  
 KEGGREST           1.32.0   2021-05-19 [1] Bioconductor  
 lifecycle          1.0.0    2021-02-15 [1] CRAN (R 4.1.0)
 magrittr           2.0.1    2020-11-17 [1] CRAN (R 4.1.0)
 memoise            2.0.0    2021-01-26 [1] CRAN (R 4.1.0)
 org.At.tair.db   * 3.13.0   2021-08-31 [1] Bioconductor  
 pillar             1.6.2    2021-07-29 [1] CRAN (R 4.1.0)
 pkgbuild           1.2.0    2020-12-15 [1] CRAN (R 4.1.0)
 pkgconfig          2.0.3    2019-09-22 [1] CRAN (R 4.1.0)
 pkgload            1.2.1    2021-04-06 [1] CRAN (R 4.1.0)
 png                0.1-7    2013-12-03 [1] CRAN (R 4.1.0)
 prettyunits        1.1.1    2020-01-24 [1] CRAN (R 4.1.0)
 processx           3.5.2    2021-04-30 [1] CRAN (R 4.1.0)
 ps                 1.6.0    2021-02-28 [1] CRAN (R 4.1.0)
 purrr              0.3.4    2020-04-17 [1] CRAN (R 4.1.0)
 R6                 2.5.1    2021-08-19 [1] CRAN (R 4.1.1)
 Rcpp               1.0.7    2021-07-07 [1] CRAN (R 4.1.1)
 RCurl              1.98-1.4 2021-08-17 [1] CRAN (R 4.1.1)
 remotes            2.4.0    2021-06-02 [1] CRAN (R 4.1.0)
 rlang              0.4.11   2021-04-30 [1] CRAN (R 4.1.0)
 rprojroot          2.0.2    2020-11-15 [1] CRAN (R 4.1.0)
 RSQLite            2.2.8    2021-08-21 [1] CRAN (R 4.1.1)
 rstudioapi         0.13     2020-11-12 [1] CRAN (R 4.1.0)
 S4Vectors        * 0.30.0   2021-05-19 [1] Bioconductor  
 sessioninfo        1.1.1    2018-11-05 [1] CRAN (R 4.1.0)
 testthat           3.0.2    2021-02-14 [1] CRAN (R 4.1.0)
 tibble             3.1.2    2021-05-16 [1] CRAN (R 4.1.0)
 tidyselect         1.1.1    2021-04-30 [1] CRAN (R 4.1.0)
 usethis          * 2.0.1    2021-02-10 [1] CRAN (R 4.1.0)
 utf8               1.2.1    2021-03-12 [1] CRAN (R 4.1.0)
 vctrs              0.3.8    2021-04-29 [1] CRAN (R 4.1.0)
 withr              2.4.2    2021-04-18 [1] CRAN (R 4.1.0)
 XVector            0.32.0   2021-05-19 [1] Bioconductor  
 zlibbioc           1.38.0   2021-05-19 [1] Bioconductor  

[1] E:/R/R-4.1.0/library

Best wishes

Guandong Shang

ADD REPLY
1
Entering edit mode

You are right - I mis-spoke (mis-wrote?) about GOALL. I get confused because in my mind the directionality of the GO DAG and GOALL are switched, which seems weird. But when you do a hypergeometric test, you need to know all the GO terms that a given TAIR ID maps to, and that's the direct GO term and all of its ancestors.

Anyway, I think the problem with your code (I don't speak tidyverse, so I didn't really look that closely) is that you filter the ancestors on BP. I believe that your main point is that GOALL is missing some direct GO mappings, which is not true, at least for this TAIR ID:

> z <- select(org.At.tair.db, "AT2G17950", "GO")
'select()' returned 1:many mapping between keys and columns
> zall <- select(org.At.tair.db, "AT2G17950", "GOALL")
'select()' returned 1:many mapping between keys and columns
> all(z$GO %in% zall$GOALL)
[1] TRUE

## and the four you list
> miss <- c("GO:0000976", "GO:0003700", "GO:0005515", "GO:0005634")
> miss %in% z$GO
[1] TRUE TRUE TRUE TRUE
> miss %in% zall$GOALL
[1] TRUE TRUE TRUE TRUE

Which indicates that all the direct GO terms are in GOALL as well. But if you filter on BP, you run into problems because the direct terms aren't all BP (like the four you say are missing).

> zall[zall$GOALL %in% miss,]
        TAIR      GOALL EVIDENCEALL ONTOLOGYALL
2  AT2G17950 GO:0000976         IPI          MF
12 AT2G17950 GO:0003700         ISS          MF
13 AT2G17950 GO:0003700         TAS          MF
15 AT2G17950 GO:0005515         IPI          MF
18 AT2G17950 GO:0005634         ISM          CC

Whereas the direct GO terms are all the ontologies

> z
        TAIR         GO EVIDENCE ONTOLOGY
1  AT2G17950 GO:0000976      IPI       MF
2  AT2G17950 GO:0003700      ISS       MF
3  AT2G17950 GO:0003700      TAS       MF
4  AT2G17950 GO:0005515      IPI       MF
5  AT2G17950 GO:0005634      ISM       CC
6  AT2G17950 GO:0006355      TAS       BP
7  AT2G17950 GO:0010492      IGI       BP
8  AT2G17950 GO:0019827      IMP       BP
9  AT2G17950 GO:0045471      IEP       BP
10 AT2G17950 GO:0048653      IMP       BP
11 AT2G17950 GO:0080166      IMP       BP
12 AT2G17950 GO:0090506      IMP       BP
ADD REPLY
0
Entering edit mode

Thanks! I get it :).

ADD REPLY
0
Entering edit mode

thanks for the awesome information.

ADD REPLY

Login before adding your answer.

Traffic: 266 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6