Gene Ontology Overrepresentation Analysis with a custom level of Abstraction of GO Terminology
Entering edit mode
graggsd • 0
Last seen 5.8 years ago

I want to perform an overrepresentation for GO terms for a particular microarray experiment. Most of the methods to do this in R (see the "gometh" function included in the missMethyl package or "hyperGTest" in the GOstats package) only consider a test for enrichment of terms directly annotated by a given gene. I would like to be able to customize the level of abstraction of my analysis such that I can ask questions about higher level GO terms, which I'm hoping will increase the power of my analysis for specific questions by reducing the amount of multiple hypothesis testing correction that I must implement. Does anyone have insight on the best way to approach this?

gene ontology GO overrepresentation analysis missmethyl gostats • 883 views
Entering edit mode
Last seen 4 hours ago
United States

The tests consider all direct and child terms (actually any terms that are 'lower' in the directed acyclic graph (DAG) than the given term), not just the direct terms.You can condition out significant lower terms for hyperGTest, as well as the functions in topGO, but there isn't any facility that I know of to do arbitrary GO term assignment.

In general, people want the lowest term in the DAG, so long as there are a reasonable number of genes for that term, as the lower you go in the DAG, the more specific the term, and hopefully the more interpretable it will be. Anyway, as an example, let's say you have a gene that is annotated to Map kinase kinase kinase activity. When you test signal transduction by protein phosphorylation, unless you condition (and that gene gets 'used up' in a lower term), the set of signal transduction by protein phosphorylation genes will include the gene that is mapped to Map kinase kinase kinase activity. But signal transduction by protein phosphorylation is way more boring and uninformative than Map kinase kinase kinase activity, because it's such a broad term that includes so many processes. If Map kinase kinase kinase activity is significant, do you really want to ignore that in hopes that signal transduction by protein phosphorylation will remain so?

You should also note that there are two sources of power here. You obviously lose power with increasing comparisons. But you also lose power by dilution of the activity. Using the above example, say there are 20 Map kinase kinase kinase activity genes, and 8 are in your set of significant genes. That will probably end up being significant. But let's also say that there are 60 Map kinase kinase activity genes, and only the 8 Map kinase kinase kinase activity genes are significant. If you just test that term, you may no longer achieve significance because you diluted out the results with the other child terms of Map kinase kinase activity that weren't significant (btw this is all made up. Well, except for the fact that Map kinase kinase kinase activity is a grand child term of Map kinase kinase activity, and an ancestor of signal transduction by protein phosphorylation. But the numbers are fake, and only for illustration purposes).

Anyway, I don't think you are going to necessarily gain power by doing this. You could hypothetically take the GO BP terms from the Broad, cut out just the ones you like, and then do a competitive gene set test with say romer from the limma package.

If you really do want to do this, the code is Open Source and freely available. The mappings are done by a function called getGoToEntrezMap_db that is in the file categoryToEntrezBuilder-methods.R of the Category package. You could clone the sources from github, make whatever changes you like, and then install your forked package. That might be a fun project.

Entering edit mode

Regarding the way GOstats implements the hypergeometric test, I actually wasn't aware that it propagated inherited terms back through the DAG. However, since I'm looking at data from 450K arrays, I decided to implement the "gometh" function from missMethyl since it will take lists of "significant" and "universe" CpGs in lieu of gene lists. It also corrects for the bias introduced due to the variation in the number of CpGs corresponding to a given gene (between 1-1000+). Looking at the primary publication as well as the associated documentation, I can't make out whether it behaves the same way (so I'll need to consult the authors on that).

Also, that is a good point regarding diluting "out the results with the other child terms of Map kinase kinase activity that weren't significant". However, the reason I want to select a few upper level terms is that I have reason to believe (based on the condition we are studying). That genes related to B cell development, activation, etc. will likely be over-represented. However, I really can't predict which of the child terms might contribute, which is why I want to test these upper-level parent terms a priori.

Over the last few hours, I think I might have worked out a way to do this, simply through getting a gene list based on the terms I'd like to selected. I'll post the code when I think I've got it worked out, but would welcome a second look at it.

Thanks very much for the lengthy and well-thought out response!


Login before adding your answer.

Traffic: 229 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6