Hi everyone,
I’m testing for GO categories enrichment using goseq (version 1.32.0) under two different approaches (as described below) and I’m getting different numbers of genes per GO category.
### Approach 1 ###
> library(goseq) > gene.vector = as.integer(assayed.genes%in%de.genes) #19,792 total genes (502 DE genes) > pwf = nullp(gene.vector, "bosTau4", "ensGene") > GO.hiper = goseq(pwf, "bosTau4", "ensGene") > GO.hiper[GO.hiper$category=="GO:0005509",] category over_represented_pvalue under_represented_pvalue numDEInCat numInCat term ontology 1788 GO:0005509 0.01579858 0.9942288 10 182 calcium ion binding MF
Even if I force the inclusion of genes without categories in the test, the number of genes per category is the same!
> GO.hiper = goseq(pwf, "bosTau4", "ensGene", use_genes_without_cat = TRUE) > GO.hiper[GO.hiper$category=="GO:0005509",] category over_represented_pvalue under_represented_pvalue numDEInCat numInCat term ontology 1788 GO:0005509 0.02582693 0.9895319 10 182 calcium ion binding MF
### Approach 2 ###
> library(biomaRt) > genome = useMart(biomart = "ENSEMBL_MART_ENSEMBL", dataset = "btaurus_gene_ensembl", host = "www.ensembl.org") > Ca.gene.bio <- getBM(attributes=c("ensembl_gene_id", "start_position", "end_position", "chromosome_name", "external_gene_name"), filters = 'go', values = 'GO:0005509', mart = genome) #589 genes > geneTOcat = data.frame(cbind(Ca.gene.bio$ensembl_gene_id,"GO:0005509")) > GO.hiper = goseq(pwf, "bosTau4", "ensGene", gene2cat = geneTOcat, use_genes_without_cat = TRUE) > GO.hiper[GO.hiper$category=="GO:0005509",] category over_represented_pvalue under_represented_pvalue numDEInCat numInCat term ontology 1 GO:0005509 0.02445965 0.9861623 23 520 calcium ion binding MF
Does the goseq use GO.db package (GO.db_3.6.0) to retrieve the genes listed in GO categories?
Is the GO annotation version used by goseq the latest one available on BioMart (biomaRt_2.36.1)?
It seems that goseq is not retrieving all the genes listed in a GO category. How can I fix it in the first approach?
Thank you for taking the time to read my question.
Have a nice weekend!
Fernanda Rezende
University of Florida, USA
I was curious though about how 1) the number of Ensembl IDs is almost exactly a factor of 3 greater than the number of Entrez IDs, and 2) the over-representation p-values come out almost exactly the same. I suppose the factor of 3 is just a coincidence but I'm surprised the gene set size is so different, and given that difference, I'm surprised the p-values were so consistent between the two approaches.
I totally agree with your point about different sources providing quite distinct annotations, but are those two scenarios really comparing different gene IDs? I don't know much about the goseq functions, but both examples provide the argument 'ensGene' which implies using the Ensembl IDs in both instances to me.
I more wonder if this may be more related to using an out-of-date genome/annotation. The bosTau4 genome was released in 2007, and I think we're up to bosTau8 now. I wonder if goseq/GO.db struggles when it tries to identify the GO terms, where as when using biomaRt you basically bypass that.
The argument isn't 'ensGene', that's the input for the 'genes' argument, which tells goseq what sort of ID you are using. If the OP were using Entrez Gene IDs, then the argument would be 'knownGene'.
The genome has very little to do with this, other than for getting the gene lengths. In my mind, annotations and genome builds are almost completely orthogonal, with genome builds having releases and being tied to a certain time point (and also having only to do with where things are in the genome), whereas annotations have almost nothing to do with genome builds, but instead have to do with what genes we think might be in the genome, and what those genes do, etc. The gene annotations are updated almost constantly, with things like RefSeq having weekly updates, whereas the genome builds are released and then remain almost static.
That's a long way of saying that the data in GO.db and org.Bt.eg.db are current as of early 2018, with the notable exception of the Ensembl data which appears to come from 2017, for whatever reason. And that's the data we are relying upon to match GO and Ensembl IDs.
The weird thing (to me) with
goseq
is that you can (and have to, in this case) use Ensembl Gene IDs, but if you don't pass in agene2cat
data.frame, then under the hood it will callgetgo
, which just dumps out theorg.Bt.egGOALLEGS
table from org.Bt.eg.db, which is a mapping from EG to GO, and then uses theorg.Bt.egENSEMBL
table to map EG to Ensembl IDs. Which is where the problem lies.So there are two issues here. First, there are only 211 Entrez Gene IDs that map to GO:0005509, according to (evidently) NCBI and Geneontology, as of April this year. There are more that map from Ensembl IDs, but the way
getgo
works, using old code liketoTable
rather thanselect
ormapIds
, we lose out on bunches of EG -> Ensembl mappings:James, I am not a goseq author and I haven't examined the code, but it seems to me that goseq is most likely doing the right thing.
You compute a vector
huh2
of over 500 gene ids, but only 184 of those ids can be mapped to GO terms using the organism package. The entries that are NA ingene2cat
but not NA inhuh2
are exactly those ids that can't be mapped.I think that's correct. Thanks for pointing that out. The only real issue here (IMO) then is that Ensembl IDs will be silently converted to Gene IDs to do the GO mapping if a
gene2cat
data.frame
isn't supplied. IMO this is a fraught process! The OP has already documented that EBI/EMBL have quite different ideas about what GO terms should be attributed to a given Ensembl ID as compared to NCBI.While there are some warnings about sticking with the same annotation service (particularly in the details section for
getgo
), both the example forgetgo
and the vignette seem to indicate that this is a non-issue, as Ensembl IDs are used for both the example forgetgo
, as well as the examples in the vignette. This seems to indicate that going across annotation services is a reasonable thing to do, whereas I would argue just the opposite.I agree that sticking to native gene ids is desirable if one can do it, and I agree that the goseq documentation gives the impression of being Ensembl orientated whereas it actually uses the Bioc organism packages by default.
But do you really mean to say that mapping gene Ids is never a sensible thing to do? The Bioconductor organism packages are Entrez Gene based, so Bioconductor users and developers have no choice but to convert back to Entrez Gene Ids if they wish to use Bioconductor annotation infrastructure.
I think that GO annotation pretty much always involves mapping across annotation systems anyway. For mouse, the master GO annotation source is MGI, for drosophila it is flybase, so the NCBI and Ensembl annotation for those species are both translated from the original. A little bit of searching suggests that the bos Taurus GO associations may be UniProt based. For bos Taurus, I would guess that most of the GO annotation is based on mapping orthologs from other species anyway.
No, I am not saying that mapping across services is never a sensible thing to do. What I am saying is that it is a non-trivial thing to do, because it may have unexpected consequences. For example, in the original post to this thread the OP states that the Ensembl mappings for that GO term were for 520 genes. But after processing through the Gene IDs, we are down to 182.
This didn't have an effect on the results for this particular GO term, because there was a proportional loss of the DE genes as well as the nonDE genes for that category. Is that consistent? I sort of doubt it. To check we could simulate some data:
A Venn diagram of the significantly over-represented GO terms (using unadjusted p < 0.05), where use_genes_without_cat = TRUE:
and if we only use those genes with a GO term:
That's IMO a big difference in results, based on a purely technical issue of how the genes were mapped to GO IDs.
Thanks for the comprehensive answer. I didn't appreciate that under the hood the core relationship is always GOTerm-to-EntrezID and that the will always be another mapping required for Gene IDs from any other annotation.