GO vs GOALL in rrvgo and GOSemSim
Entering edit mode
Jenny Drnevich ★ 2.0k
Last seen 29 days ago
United States

I was playing around with the new rrvgo package to reduce GO terms and quickly came across a puzzle: it was somehow missing some of the GO terms. A reproducible example:

> library(org.Hs.eg.db)
> library(limma)
> library(rrvgo)

> #Get random set of genes
> set.seed(8)
> r.genes <- sample(keys(org.Hs.eg.db, keytype = "ENTREZID"), 750)

> # Do limma's goana test for over-representation
> gotest <- goana(r.genes, universe = keys(org.Hs.eg.db, keytype = "ENTREZID"))
> gotest <- gotest[gotest$P.DE < 0.05,]
> table(gotest$Ont)
 BP  CC  MF 
294  28  75 

> #Run rrvgo::calculateSimMatrix on only BP terms
> BPterms <- rownames(gotest)[gotest$Ont %in% "BP"]
> length(BPterms)
[1] 294
> simMatrix <- calculateSimMatrix(BPterms,
+                                 orgdb="org.Hs.eg.db",
+                                 ont="BP",
+                                 method="Rel")
preparing gene to GO mapping data...
preparing IC data...
Warning message:
In calculateSimMatrix(rownames(gotest)[gotest$Ont %in% "BP"], orgdb = "org.Hs.eg.db",  :
  Removed 30 terms that were not found in orgdb for BP

Why was it not finding 30 BP terms that were output from goana()? After some digging, I found that it was because calculateSimMatrix() is using GOSemSim::godata(), which in its code is pulling only GO terms and not GOALL terms. Example:

> #Pulled from the code of GOSemSim::godata:
> kk <- keys(org.Hs.eg.db, keytype = "ENTREZID")
> goAnno <- suppressMessages(select(org.Hs.eg.db, keys = kk, keytype = "ENTREZID", 
+                                   columns = c("GO", "ONTOLOGY")))
> sum(!BPterms %in% goAnno$GO)
[1] 58
> #Not sure why this is 59 and not 30, but moving on...

> #instead, pull GOALL and ONTOLOGYALL
> goAnno2 <- suppressMessages(select(org.Hs.eg.db, keys = kk, keytype = "ENTREZID", 
+                                   columns = c("GOALL", "ONTOLOGYALL")))
> sum(!BPterms %in% goAnno2$GO)
[1] 0

Now, these 30 terms that were excluded do not seem to be directly annotated to any human gene, and so it is probably fine that they get excluded. However, what about the case where gene A gets annotated to child term Y while gene B only gets annotated to Y's parent Z? When testing Z, only 1 gene will be seen rather than 2 as I think should be the case. rrvgo::calculateSimMatrix() does seem to be pulling ancestors, but my hacking to create a GOSemSimDATA object that was based on GOALL does seem lead to different similarity matrices. That code is a bit messy so I won't post it here (and I'm not sure it's 100% correct) but I thought I'd stop and ask this question before spending any more time: Should only directly attributed GO terms be used when reducing GO complexity or should the GOALL mappings be used?

Thanks in advance for any opinions!

rrvgo GOSemSim GO.db • 1.5k views
Entering edit mode

Hi Jenny, this is a very good point. From the Gene Ontology page describing the annotations they provide (where the org.Hs.eg.db take GO annotation from), they state that:

General principles of GO annotations: <...>

  • Gene products are annotated to the most granular term in the ontology that is supported by the available evidence.
  • By the transitivity principle, an annotation to a GO term implies annotation to all its parents.


To my understanding, AnnotationForge:::.makeNewGOTables() is applying this transitivity principle in order to create the GO2ALLEGS map from the GO map. Therefore, the use of the GOALL may be more appropriate.


Login before adding your answer.

Traffic: 638 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6