Hi all,
Using the GoStats package and the org.Mm.eg.db, org.Mm.egGO annotation data I created a list of over represented GO terms for my data set. However, when I attempt to match my genes of interest with the GO terms, the number of genes associated with each GO terms does not perfectly match the "Count" number of genes determined by GoStats. I have tried a variety of methods to over come this issue including:
Using bioMart ensembl data
The AmiGO2 Database
The BioGPS database
Searching within the org.Mm.egGO annotation
but none seem to perfectly match the expected count number. Has anyone else had a similar problem to this and have a solution/ an explanation for why these lists are not matching?
Thanks!
You will have to show your code, including how you instantiated your GOHyperGParams object, and how you tried to compute the expected counts, if you expect anybody to be able to help you.
Right you are:
ontology="BP"
Gosbp1 <- mget(as.character(SetUniv$entrezgene), org.Mm.egGO, ifnotfound=NA)
haveGobp1 <- sapply(Gosbp1, function(x) {
if (length(x) == 1 && is.na(x)) FALSE
else {
onts <- Biobase::subListExtract(x, "Ontology",
simplify=TRUE)
ontology %in% onts
}})
SetUnivgobp1<-SetUniv[haveGobp1,]
ontology="BP"
Gosbp2 <- mget(as.character(GOI$entrezgene), org.Mm.egGO, ifnotfound=NA)
haveGobp2 <- sapply(Gosbp2, function(x) {
if (length(x) == 1 && is.na(x)) FALSE
else {
onts <- Biobase::subListExtract(x, "Ontology",
simplify=TRUE)
ontology %in% onts
}})
GOIgobp2<-GOI[haveGobp2,]
paramsbp<- new("GOHyperGParams", geneIds = GOIgobp2$entrezgene , universeGeneIds=SetUnivgobp1$entrenzgene,annotation="org.Mm.eg.db", pvalueCutoff=.001, ontology = "BP", conditional= TRUE, testDirection="over")
hgbp<-hyperGTest(paramsbp)
hgbps<-summary(hgbp)
If you use conditional = TRUE, then the number of genes in a GO term is conditional on whether or not those genes were already significant in a term lower on the DAG. As an artificial example, say there are two terms, MAP Kinase, and MAP Kinase Kinase, and there are 20 genes annotated to the former, and 10 to the latter.
Since this is a DAG, by definition, 10 of the 20 genes in MAP Kinase are the 10 genes that annotate to MAP Kinase Kinase (because a MAP Kinase Kinase is also by definition a MAP Kinase as well). Now say that 8 of the 10 genes in MAP Kinase Kinase are significant, so that term is found to be significant. Since you are using the conditional hypergeometric, when you go to the next level (MAP Kinase), the genes that gave rise to significance in the lower term are removed, because you would be double counting them (I forget if all 10 are removed, or just the 8 significant ones - IIRC it's all of them, but don't quote me on that).
So if MAP Kinase also comes up as significant, the expected count will only be the 10 that were still annotated to that term, after conditioning out the lower genes from MAP Kinase Kinase.