Search
Question: Genes listed in GO categories
0
8 days ago by
frezende0
frezende0 wrote:

Hi everyone,

I’m testing for GO categories enrichment using goseq (version 1.32.0) under two different approaches (as described below) and I’m getting different numbers of genes per GO category.

### Approach 1 ###

> library(goseq)
> gene.vector = as.integer(assayed.genes%in%de.genes)  #19,792 total genes (502 DE genes)
> pwf = nullp(gene.vector, "bosTau4", "ensGene")
> GO.hiper = goseq(pwf, "bosTau4", "ensGene")
> GO.hiper[GO.hiper$category=="GO:0005509",] category over_represented_pvalue under_represented_pvalue numDEInCat numInCat term ontology 1788 GO:0005509 0.01579858 0.9942288 10 182 calcium ion binding MF Even if I force the inclusion of genes without categories in the test, the number of genes per category is the same! > GO.hiper = goseq(pwf, "bosTau4", "ensGene", use_genes_without_cat = TRUE) > GO.hiper[GO.hiper$category=="GO:0005509",]
category over_represented_pvalue under_represented_pvalue numDEInCat numInCat  term ontology
1788 GO:0005509       0.02582693                0.9895319         10      182 calcium ion binding       MF

​### Approach 2 ###

> library(biomaRt)
> genome = useMart(biomart = "ENSEMBL_MART_ENSEMBL", dataset = "btaurus_gene_ensembl", host = "www.ensembl.org")
> Ca.gene.bio <- getBM(attributes=c("ensembl_gene_id", "start_position", "end_position", "chromosome_name", "external_gene_name"), filters = 'go', values = 'GO:0005509', mart = genome)  #589 genes
> geneTOcat = data.frame(cbind(Ca.gene.bio$ensembl_gene_id,"GO:0005509")) > GO.hiper = goseq(pwf, "bosTau4", "ensGene", gene2cat = geneTOcat, use_genes_without_cat = TRUE) > GO.hiper[GO.hiper$category=="GO:0005509",]
category over_represented_pvalue under_represented_pvalue numDEInCat numInCat  term ontology
1 GO:0005509              0.02445965                0.9861623         23      520 calcium ion binding       MF

​Does the goseq use GO.db package (GO.db_3.6.0) to retrieve the genes listed in GO categories?

Is the GO annotation version used by goseq the latest one available on BioMart (biomaRt_2.36.1)?

It seems that goseq is not retrieving all the genes listed in a GO category. How can I fix it in the first approach?

Thank you for taking the time to read my question.

Have a nice weekend!

Fernanda Rezende

University of Florida, USA

modified 1 day ago by Gordon Smyth33k • written 8 days ago by frezende0
1
6 days ago by
United States
James W. MacDonald46k wrote:

You are, in situation #1, asking for all the Entrez Gene IDs that are associated with a given GO term. And in situation #2 you are asking for all the Ensembl Gene IDs that are associated with a given GO term. These are not the same thing! If you ask two different groups the same question you shouldn't be too surprised if they give different answers.

Put a different way, you seem to think that there is a known set of genes and everybody thinks the set is the same. This is not true. There are any number of differences between the annotations you can get from NCBI as compared to what you can get from EBI/EMBL.

I was curious though about how 1) the number of Ensembl IDs is almost exactly a factor of 3 greater than the number of Entrez IDs, and 2) the over-representation p-values come out almost exactly the same. I suppose the factor of 3 is just a coincidence but I'm surprised the gene set size is so different, and given that difference, I'm surprised the p-values were so consistent between the two approaches.

I totally agree with your point about different sources providing quite distinct annotations, but are those two scenarios really comparing different gene IDs?  I don't know much about the goseq functions, but both examples provide the argument 'ensGene' which implies using the Ensembl IDs in both instances to me.

I more wonder if this may be more related to using an out-of-date genome/annotation.  The bosTau4 genome was released in 2007, and I think we're up to bosTau8 now.  I wonder if goseq/GO.db struggles when it tries to identify the GO terms, where as when using biomaRt you basically bypass that.

3

The argument isn't 'ensGene', that's the input for the 'genes' argument, which tells goseq what sort of ID you are using. If the OP were using Entrez Gene IDs, then the argument would be 'knownGene'.

The genome has very little to do with this, other than for getting the gene lengths. In my mind, annotations and genome builds are almost completely orthogonal, with genome builds having releases and being tied to a certain time point (and also having only to do with where things are in the genome), whereas annotations have almost nothing to do with genome builds, but instead have to do with what genes we think might be in the genome, and what those genes do, etc. The gene annotations are updated almost constantly, with things like RefSeq having weekly updates, whereas the genome builds are released and then remain almost static.

That's a long way of saying that the data in GO.db and org.Bt.eg.db are current as of early 2018, with the notable exception of the Ensembl data which appears to come from 2017, for whatever reason. And that's the data we are relying upon to match GO and Ensembl IDs.

The weird thing (to me) with goseq is that you can (and have to, in this case) use Ensembl Gene IDs, but if you don't pass in a gene2cat data.frame, then under the hood it will call getgo, which just dumps out the org.Bt.egGOALLEGS table from org.Bt.eg.db, which is a mapping from EG to GO, and then uses the org.Bt.egENSEMBL table to map EG to Ensembl IDs. Which is where the problem lies.

> z <- getBM("ensembl_gene_id", "go","GO:0005509", mart)
> gene.vector <- sample(0:1, length(z[,1]), TRUE)
> names(gene.vector) <- z[,1]
> pwf <- nullp(gene.vector, "bosTau4","ensGene")
DEgenes bias.data       pwf
ENSBTAG00000017746       0    4551.0 0.5106931
ENSBTAG00000002865       1     665.5 0.4503569
ENSBTAG00000007244       1    5812.0 0.5302726
ENSBTAG00000009696       1    2656.0 0.4812669
ENSBTAG00000011862       1    2858.0 0.4844038
ENSBTAG00000020657       1   13793.0 0.6541682
> dim(pwf)
[1] 589   3
> gene2cat = getgo(rownames(pwf), "bosTau4","ensGene")

> sum(sapply(gene2cat, is.null))
[1] 402
> length(gene2cat)
[1] 589

So there are two issues here. First, there are only 211 Entrez Gene IDs that map to GO:0005509, according to (evidently) NCBI and Geneontology, as of April this year. There are more that map from Ensembl IDs, but the way getgo works, using old code like toTable rather than select or mapIds, we lose out on bunches of EG -> Ensembl mappings:

> head(names(gene2cat))
[1] NA                   NA                   NA
[4] "ENSBTAG00000009696" NA                   NA

> huh2 <- mapIds(org.Bt.eg.db, row.names(pwf), "ENTREZID", "ENSEMBL")
'select()' returned 1:many mapping between keys and columns
> length(huh2)
[1] 589
ENSBTAG00000017746 ENSBTAG00000002865 ENSBTAG00000007244 ENSBTAG00000009696
"615883"           "541458"                 NA           "536607"
ENSBTAG00000011862 ENSBTAG00000020657
"616023"           "508251"

> sum(is.na(huh2))
[1] 64

1

James, I am not a goseq author and I haven't examined the code, but it seems to me that goseq is most likely doing the right thing.

You compute a vector huh2 of over 500 gene ids, but only 184 of those ids can be mapped to GO terms using the organism package. The entries that are NA in gene2cat but not NA in huh2 are exactly those ids that can't be mapped.

> MyGO <- org.Bt.egGO2ALLEGS
> Rkeys(MyGO) <- "GO:0005509"
> MyEG <- mappedLkeys(MyGO)
> length(MyEG)
[1] 205
> sum(huh2 %in% MyEG, na.rm=TRUE)
[1] 184


I think that's correct. Thanks for pointing that out. The only real issue here (IMO) then is that Ensembl IDs will be silently converted to Gene IDs to do the GO mapping if a gene2cat data.frame isn't supplied. IMO this is a fraught process! The OP has already documented that EBI/EMBL have quite different ideas about what GO terms should be attributed to a given Ensembl ID as compared to NCBI.

While there are some warnings about sticking with the same annotation service (particularly in the details section for getgo), both the example for getgo and the vignette seem to indicate that this is a non-issue, as Ensembl IDs are used for both the example for getgo, as well as the examples in the vignette. This seems to indicate that going across annotation services is a reasonable thing to do, whereas I would argue just the opposite.

I agree that sticking to native gene ids is desirable if one can do it, and I agree that the goseq documentation gives the impression of being Ensembl orientated whereas it actually uses the Bioc organism packages by default.

But do you really mean to say that mapping gene Ids is never a sensible thing to do? The Bioconductor organism packages are Entrez Gene based, so Bioconductor users and developers have no choice but to convert back to Entrez Gene Ids if they wish to use Bioconductor annotation infrastructure.

I think that GO annotation pretty much always involves mapping across annotation systems anyway. For mouse, the master GO annotation source is MGI, for drosophila it is flybase, so the NCBI and Ensembl annotation for those species are both translated from the original. A little bit of searching suggests that the bos Taurus GO associations may be UniProt based. For bos Taurus, I would guess that most of the GO annotation is based on mapping orthologs from other species anyway.

No, I am not saying that mapping across services is never a sensible thing to do. What I am saying is that it is a non-trivial thing to do, because it may have unexpected consequences. For example, in the original post to this thread the OP states that the Ensembl mappings for that GO term were for 520 genes. But after processing through the Gene IDs, we are down to 182.

This didn't have an effect on the results for this particular GO term, because there was a proportional loss of the DE genes as well as the nonDE genes for that category. Is that consistent? I sort of doubt it. To check we could simulate some data:

library(RMySQL)
library(goseq)
library(biomaRt)
mart <- useMart("ensembl","btaurus_gene_ensembl")
dbconn <- dbConnect(MySQL(), user="genome", host = "genome-mysql.soe.ucsc.edu", dbname = "bosTau4")
## get all Ensembl IDs for bosTau4
btens <- dbGetQuery(dbconn, "select distinct name2 from ensGene;")
btens2 <- rep(0, nrow(btens))
names(btens2) <- btens[,1]
set.seed(0xabeef)
## simulate 500 significant genes
sig.genes <- sample(1:nrow(btens), 500)
btens2[sig.genes] <- 1
## Ensembl mapping to GO
gene2cat <- getBM(c("ensembl_gene_id","go_id"), "ensembl_gene_id", as.character(btens[,1]), mart)
pwf <- nullp(btens2, "bosTau4", "ensGene")
z <- goseq(pwf, "bosTau4", "ensGene")
zz <- goseq(pwf, "bosTau4", "ensGene", gene2cat = gene2cat)
z2 <- goseq(pwf, "bosTau4", "ensGene", use_genes_without_cat = TRUE)
zz2 <- goseq(pwf, "bosTau4", "ensGene", gene2cat = gene2cat, use_genes_without_cat = TRUE)

A Venn diagram of the significantly over-represented GO terms (using unadjusted p < 0.05), where use_genes_without_cat = TRUE:

and if we only use those genes with a GO term:

That's IMO a big difference in results, based on a purely technical issue of how the genes were mapped to GO IDs.

Thanks for the comprehensive answer.  I didn't appreciate that under the hood the core relationship is always GOTerm-to-EntrezID and that the will always be another mapping required for Gene IDs from any other annotation.

1
1 day ago by
Gordon Smyth33k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth33k wrote:

Dear Fernanda,

Actually goseq is using up-to-date annotation.

Here is what's happening. If you supply your own gene2cat annotation, then goseq will use that. If you don't, then goseq uses the GO-gene associations provided in the relevant Bioconductor organism package, in this case org.Bt.eg.db. The Bioconductor organism packages are based on NCBI annotation and Entrez Gene Ids.

For reasons that we don't completely understand, Ensembl associates over 500 Ensembl genes with your GO term but NCBI associates only 200 Entrez Genes with the same GO term. So the difference between the two approach is not a fault of goseq, but just the difference between Ensembl and NCBI.

As a general rule, NCBI is more conservative than Ensembl. Ensembl tends to include more of everything: genes, transcripts, associations and so on. NCBI tends to include only well annotated features and associations whereas Ensembl tends to include more features for which there is less (but still some) evidence.

In this case, you will notice that the enrichment of your GO term is weakened rather than strengthened when you add in the extra Ensembl gene associations, because 10/182 is stronger enrichment than 23/520. In other words, the genes given by Ensembl but not NCBI were less likely to be in your DE list than genes given by both. Now I know nothing about bos Taurus or about this particular GO term, so this particular example could be coincidence. But, as a general principle, one needs to balance the greater comprehensiveness of Ensembl with the greater reliability of NCBI, and we often do see greater enrichment rates for pathway analyses using better annotated genes.

In this case, the difference between NCBI and Ensembl is surprisingly large. You have decide yourself when to use Ensembl (European) or NCBI (American) annotation. goseq allows you to do either. Since you decided to use Ensembl genes in the first place, it would be consistent to use Ensembl gene-GO associations. However the NCBI-based analysis provided by goseq in Approach1 is probably also good, for the reasons I have explained.