Gene-GO-term relationship discrepancy between org.Hs.eg.db and geneontology.org
1
0
Entering edit mode
@katharinabaum-16998
Last seen 2.7 years ago

I observed a discrepancy between the GO-term association from org.Hs.eg.db to the information given in the AMIGO2 portal; I would highly appreciate to know why and how to reconcile it. I found it in the term "viral transcription" (which happened to be enriched in my analysis and I wanted to examine it further), "GO:0019083".

In particular, my gene list had > 100 gene hits in this term, while the AMIGO2 portal at geneontology.org shows only 64 direct and indirect gene/gene product annotations (http://amigo.geneontology.org/amigo/search/annotation?q=*:*&fq=regulates_closure:%22GO:0019083%22&sfq=document_category:%22annotation%22 - filter for Organism = human). I went to investigate further into org.Hs.eg.db using the select() function.

library(org.Hs.eg.db)

GO_virtcn <- select(org.Hs.eg.db, keys="GO:0019083", columns=c("GO","ONTOLOGY","SYMBOL","EVIDENCE"),keytype="GO") #retrieve all associations to GO term 0019083
length(unique(GO_virtcn$SYMBOL)) #more than 100 direct annotations are reported in this database

According to the manual, the org.Hs.eg.db object only contains direct annotations. What striked me most was that a bunch of those genes are simply ribosomal proteins (e.g. RPL10 - RPL18), which I would not per se consider being associated to viral forms of gene expression; and that genes associated to a child GO term, positive regulation by host of viral transcription, GO:0043923, did NOT occur in the list of genes:

GO_posregvirtcn <- select(org.Hs.eg.db, keys="GO:0043923", columns=c("GO","ONTOLOGY","SYMBOL","EVIDENCE"),keytype="GO")
sum(GO_posregvirtcn$SYMBOL %in% GO_virtcn$SYMBOL) #not a single overlap in genes

To confirm the information from the GO homepage on lower number of genes for viral transcription, I downloaded the annotation file goa_human.gaf.gz from http://www.geneontology.org/page/download-go-annotations, read it into R and queried the direct annotations for "GO:0019083":

go_ont_ebi <- read.table("P:\\goa_human.gaf",sep="\t",comment.char="",quote="",skip=23,stringsAsFactors = FALSE)
go_ont_ebi[go_ont_ebi$V5=="GO:0019083",] #this is empty --> no direct annotation for viral transcription
go_ont_ebi[go_ont_ebi$V5=="GO:0043923",] #there we find 18 annotations

There was not a single direct annotation for viral transcription (as indicated on the GO homepage; all 64 gene products were indirectly associated), but 18 genes were associated to the daughter term (which was reported to have direct association to 19 genes on the GO webpage; this discrepancy would be acceptable for me).

Why do I get this (apparent?) difference in gene association to the GO term viral transcription between the GO-delivered database and org.Hs.eg.db? And could this happen also for other GO terms?

Thank you very much in advance for support in solving this!

Relevant output from sessionInfo():

R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DBI_1.0.0             RSQLite_2.1.1         mgsa_1.28.0           gplots_3.0.1          org.Hs.eg.db_3.6.0    GO.db_3.6.0          
 [7] AnnotationDbi_1.42.1  IRanges_2.14.10       S4Vectors_0.18.3      Biobase_2.40.0        BiocGenerics_0.26.0   clusterProfiler_3.8.1

PS: I also observed >100 gene hits in viral transcription in GO overrepresentation analysis with the database packages (GO.db 3.4.0, org.Hs.eg.db 3.4.0) releases from 2016, so probably not an issue of very recent builts.

org.hs.eg.db • 1.0k views
ADD COMMENT
3
Entering edit mode
@james-w-macdonald-5106
Last seen 22 minutes ago
United States

The org.Hs.eg.db package provides mappings from NCBI Gene IDs to all other annotations, including GO. These mappings are based on data that we download from NCBI, and are simply repackaged into the OrgDb packages to make things easier for our end users. We don't do any vetting of the data, and assume that NCBI are competently doing their jobs.

The file that we use to map Gene IDs to GO is the gene2go file found here. The org.Hs.eg.db package you are using is based on data we downloaded in late March or early April, so is somewhat out of date (NCBI updates these files on a daily or weekly basis; in the interest of reproducibility we downlaod and build annotation packages semi-annually, so we make a trade-off between being absolutely current and having some stability in what's available). Anyway...

Here is what NCBI says about that file:

===========================================================================
gene2go                                         recalculated daily
---------------------------------------------------------------------------
           This file reports the GO terms that have been associated
            with Genes in Entrez Gene.
           It is generated by processing the gene_association files
            on the GO ftp site: 
            http://www.geneontology.org/GO.current.annotations.shtml
           and comparing the DB_Object_ID to annotation in Gene,
           as also reported in gene_info.gz

           Multiple gene_associations file may be used for any genome.
           If so, duplicate information is not reported; but unique
           contributions of GO terms, evidence codes, and citations are.
 
           The file that is used to establish the rules for 
           the files and fields that are used for each taxon 
           is documented in go_process.xml

Here's what is in that file. First the header

$ zcat gene2go.gz | head -n 3
#tax_id    GeneID    GO_ID    Evidence    Qualifier    GO_term    PubMed    Category
3702    814629    GO:0005634    ISM    -    nucleus    -    Component
3702    814629    GO:0008150    ND    -    biological_process    -    Process

Now for the GO term you are interested in

zcat gene2go.gz | awk '{if($1 == 9606 && $3 == "GO:0019083") print $0}' - | head -n 3
9606    3921    GO:0019083    TAS    -    viral transcription    -    Process
9606    4736    GO:0019083    TAS    -    viral transcription    -    Process
9606    4927    GO:0019083    TAS    -    viral transcription    -    Process

And finally the total number of such mappings

$ zcat gene2go.gz | awk '{if($1 == 9606 && $3 == "GO:0019083") print $0}' - | wc -l
108

Which agrees with what you are getting from the org.Hs.eg.db package. This may not correspond with what you see on the AmiGO site, but that really comes down to a disagreement between NCBI and the folks at geneontology, if in fact such a disagreement exists. What is meant by 'processing the gene_association files on the GO ftp site' is probably a pretty complex process, particularly given that the associations on GO are with UniProt and NCBI is mapping to Gene ID.

ADD COMMENT
0
Entering edit mode

I thank you very much for this explanation! From the viewpoint of R, it is comprehensive, org.Hs.eg.db is really not to blame.

I guess I will have to ask at NCBI directly then - getting at all non-zero direct annotations for GO terms which originally had zero direct annotations is at least irritating (as this is really hard to be explained by any conversion of gene IDs from or to whatsoever only ;). Examining further, I found discrepancies in gene count of more than 5 genes (on the SYMBOL level) for more than 500 GO terms among the ca. 12500 GO BP terms...

ADD REPLY
1
Entering edit mode

I should point out that the GO hypergeometric test doesn't use the direct mappings anyway. By definition, all genes that are annotated to any progeny term are also annotated to a given term (e.g., all the genes that are annotated to positive regulation by host of viral transcription are also annotated to viral transcription, because the former is a subset of the latter).

When you do a  GO hypergeometric test, it uses GOALL, not GO. If you use a conditional test there may be some genes that map to progeny terms that get removed because they gave rise to a significant result at a lower level of the GO DAG, but for an unconditional test you expect the count of genes that are in a given term to be the sum of all the progeny term genes, plus the direct associations.

ADD REPLY

Login before adding your answer.

Traffic: 444 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6