I observed a discrepancy between the GO-term association from org.Hs.eg.db to the information given in the AMIGO2 portal; I would highly appreciate to know why and how to reconcile it. I found it in the term "viral transcription" (which happened to be enriched in my analysis and I wanted to examine it further), "GO:0019083".
In particular, my gene list had > 100 gene hits in this term, while the AMIGO2 portal at geneontology.org shows only 64 direct and indirect gene/gene product annotations (http://amigo.geneontology.org/amigo/search/annotation?q=*:*&fq=regulates_closure:%22GO:0019083%22&sfq=document_category:%22annotation%22 - filter for Organism = human). I went to investigate further into org.Hs.eg.db using the select() function.
library(org.Hs.eg.db) GO_virtcn <- select(org.Hs.eg.db, keys="GO:0019083", columns=c("GO","ONTOLOGY","SYMBOL","EVIDENCE"),keytype="GO") #retrieve all associations to GO term 0019083 length(unique(GO_virtcn$SYMBOL)) #more than 100 direct annotations are reported in this database
According to the manual, the org.Hs.eg.db object only contains direct annotations. What striked me most was that a bunch of those genes are simply ribosomal proteins (e.g. RPL10 - RPL18), which I would not per se consider being associated to viral forms of gene expression; and that genes associated to a child GO term, positive regulation by host of viral transcription, GO:0043923, did NOT occur in the list of genes:
GO_posregvirtcn <- select(org.Hs.eg.db, keys="GO:0043923", columns=c("GO","ONTOLOGY","SYMBOL","EVIDENCE"),keytype="GO") sum(GO_posregvirtcn$SYMBOL %in% GO_virtcn$SYMBOL) #not a single overlap in genes
To confirm the information from the GO homepage on lower number of genes for viral transcription, I downloaded the annotation file goa_human.gaf.gz from http://www.geneontology.org/page/download-go-annotations, read it into R and queried the direct annotations for "GO:0019083":
go_ont_ebi <- read.table("P:\\goa_human.gaf",sep="\t",comment.char="",quote="",skip=23,stringsAsFactors = FALSE) go_ont_ebi[go_ont_ebi$V5=="GO:0019083",] #this is empty --> no direct annotation for viral transcription go_ont_ebi[go_ont_ebi$V5=="GO:0043923",] #there we find 18 annotations
There was not a single direct annotation for viral transcription (as indicated on the GO homepage; all 64 gene products were indirectly associated), but 18 genes were associated to the daughter term (which was reported to have direct association to 19 genes on the GO webpage; this discrepancy would be acceptable for me).
Why do I get this (apparent?) difference in gene association to the GO term viral transcription between the GO-delivered database and org.Hs.eg.db? And could this happen also for other GO terms?
Thank you very much in advance for support in solving this!
Relevant output from sessionInfo():
R version 3.5.1 (2018-07-02) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 Matrix products: default attached base packages:  parallel stats4 stats graphics grDevices utils datasets methods base other attached packages:  DBI_1.0.0 RSQLite_2.1.1 mgsa_1.28.0 gplots_3.0.1 org.Hs.eg.db_3.6.0 GO.db_3.6.0  AnnotationDbi_1.42.1 IRanges_2.14.10 S4Vectors_0.18.3 Biobase_2.40.0 BiocGenerics_0.26.0 clusterProfiler_3.8.1
PS: I also observed >100 gene hits in viral transcription in GO overrepresentation analysis with the database packages (GO.db 3.4.0, org.Hs.eg.db 3.4.0) releases from 2016, so probably not an issue of very recent builts.