GoStats ontology count number does not match genes associated with GO term
0
0
Entering edit mode
@adamgetzler-11491
Last seen 7.6 years ago

Hi all,

Using the GoStats package and the org.Mm.eg.db,  org.Mm.egGO annotation data I created a list of over represented GO terms for my data set. However, when I attempt to match my genes of interest with the GO terms, the number of genes associated with each GO terms does not perfectly match the "Count" number of genes determined by GoStats. I have tried a variety of methods to over come this issue including:

Using bioMart ensembl data

The AmiGO2 Database

The BioGPS database 

Searching within the org.Mm.egGO annotation

but none seem to perfectly match the expected count number. Has anyone else had a similar problem to this and have a solution/ an explanation for why these lists are not matching?

 

Thanks!

gostat go terms gene ontology • 1.0k views
ADD COMMENT
0
Entering edit mode

You will have to show your code, including how you instantiated your GOHyperGParams object, and how you tried to compute the expected counts, if you expect anybody to be able to help you.

ADD REPLY
0
Entering edit mode

Right you are:


ontology="BP"
Gosbp1 <- mget(as.character(SetUniv$entrezgene), org.Mm.egGO, ifnotfound=NA)
haveGobp1 <- sapply(Gosbp1, function(x) {
  if (length(x) == 1 && is.na(x)) FALSE
  else {
    onts <- Biobase::subListExtract(x, "Ontology",
                                    simplify=TRUE)
    ontology %in% onts  
  }})

SetUnivgobp1<-SetUniv[haveGobp1,]


ontology="BP"

Gosbp2 <- mget(as.character(GOI$entrezgene), org.Mm.egGO, ifnotfound=NA)
haveGobp2 <- sapply(Gosbp2, function(x) {
  if (length(x) == 1 && is.na(x)) FALSE
  else {
    onts <- Biobase::subListExtract(x, "Ontology",
                                    simplify=TRUE)
    ontology %in% onts  
  }})

GOIgobp2<-GOI[haveGobp2,]


paramsbp<- new("GOHyperGParams", geneIds = GOIgobp2$entrezgene , universeGeneIds=SetUnivgobp1$entrenzgene,annotation="org.Mm.eg.db", pvalueCutoff=.001, ontology = "BP", conditional= TRUE, testDirection="over")

hgbp<-hyperGTest(paramsbp)
hgbps<-summary(hgbp)

 

ADD REPLY
1
Entering edit mode

If you use conditional = TRUE, then the number of genes in a GO term is conditional on whether or not those genes were already significant in a term lower on the DAG. As an artificial example, say there are two terms, MAP Kinase, and MAP Kinase Kinase, and there are 20 genes annotated to the former, and 10 to the latter.

Since this is a DAG, by definition, 10 of the 20 genes in MAP Kinase are the 10 genes that annotate to MAP Kinase Kinase (because a MAP Kinase Kinase is also by definition a MAP Kinase as well). Now say that 8 of the 10 genes in MAP Kinase Kinase are significant, so that term is found to be significant. Since you are using the conditional hypergeometric, when you go to the next level (MAP Kinase), the genes that gave rise to significance in the lower term are removed, because you would be double counting them (I forget if all 10 are removed, or just the 8 significant ones - IIRC it's all of them, but don't quote me on that).

So if MAP Kinase also comes up as significant, the expected count will only be the 10 that were still annotated to that term, after conditioning out the lower genes from MAP Kinase Kinase.

ADD REPLY

Login before adding your answer.

Traffic: 1101 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6