I was playing around with the new rrvgo package to reduce GO terms and quickly came across a puzzle: it was somehow missing some of the GO terms. A reproducible example:
> library(org.Hs.eg.db) > library(limma) > library(rrvgo) > #Get random set of genes > set.seed(8) > r.genes <- sample(keys(org.Hs.eg.db, keytype = "ENTREZID"), 750) > # Do limma's goana test for over-representation > gotest <- goana(r.genes, universe = keys(org.Hs.eg.db, keytype = "ENTREZID")) > gotest <- gotest[gotest$P.DE < 0.05,] > table(gotest$Ont) BP CC MF 294 28 75 > #Run rrvgo::calculateSimMatrix on only BP terms > BPterms <- rownames(gotest)[gotest$Ont %in% "BP"] > length(BPterms)  294 > simMatrix <- calculateSimMatrix(BPterms, + orgdb="org.Hs.eg.db", + ont="BP", + method="Rel") preparing gene to GO mapping data... preparing IC data... Warning message: In calculateSimMatrix(rownames(gotest)[gotest$Ont %in% "BP"], orgdb = "org.Hs.eg.db", : Removed 30 terms that were not found in orgdb for BP
Why was it not finding 30 BP terms that were output from
goana()? After some digging, I found that it was because
calculateSimMatrix() is using
GOSemSim::godata(), which in its code is pulling only GO terms and not GOALL terms. Example:
> #Pulled from the code of GOSemSim::godata: > > kk <- keys(org.Hs.eg.db, keytype = "ENTREZID") > goAnno <- suppressMessages(select(org.Hs.eg.db, keys = kk, keytype = "ENTREZID", + columns = c("GO", "ONTOLOGY"))) > > sum(!BPterms %in% goAnno$GO)  58 > #Not sure why this is 59 and not 30, but moving on... > #instead, pull GOALL and ONTOLOGYALL > goAnno2 <- suppressMessages(select(org.Hs.eg.db, keys = kk, keytype = "ENTREZID", + columns = c("GOALL", "ONTOLOGYALL"))) > sum(!BPterms %in% goAnno2$GO)  0
Now, these 30 terms that were excluded do not seem to be directly annotated to any human gene, and so it is probably fine that they get excluded. However, what about the case where gene A gets annotated to child term Y while gene B only gets annotated to Y's parent Z? When testing Z, only 1 gene will be seen rather than 2 as I think should be the case.
rrvgo::calculateSimMatrix() does seem to be pulling ancestors, but my hacking to create a GOSemSimDATA object that was based on GOALL does seem lead to different similarity matrices. That code is a bit messy so I won't post it here (and I'm not sure it's 100% correct) but I thought I'd stop and ask this question before spending any more time: Should only directly attributed GO terms be used when reducing GO complexity or should the GOALL mappings be used?
Thanks in advance for any opinions!