I want to perform an overrepresentation for GO terms for a particular microarray experiment. Most of the methods to do this in R (see the "gometh" function included in the missMethyl package or "hyperGTest" in the GOstats package) only consider a test for enrichment of terms directly annotated by a given gene. I would like to be able to customize the level of abstraction of my analysis such that I can ask questions about higher level GO terms, which I'm hoping will increase the power of my analysis for specific questions by reducing the amount of multiple hypothesis testing correction that I must implement. Does anyone have insight on the best way to approach this?
The tests consider all direct and child terms (actually any terms that are 'lower' in the directed acyclic graph (DAG) than the given term), not just the direct terms.You can condition out significant lower terms for
hyperGTest, as well as the functions in
topGO, but there isn't any facility that I know of to do arbitrary GO term assignment.
In general, people want the lowest term in the DAG, so long as there are a reasonable number of genes for that term, as the lower you go in the DAG, the more specific the term, and hopefully the more interpretable it will be. Anyway, as an example, let's say you have a gene that is annotated to Map kinase kinase kinase activity. When you test signal transduction by protein phosphorylation, unless you condition (and that gene gets 'used up' in a lower term), the set of signal transduction by protein phosphorylation genes will include the gene that is mapped to Map kinase kinase kinase activity. But signal transduction by protein phosphorylation is way more boring and uninformative than Map kinase kinase kinase activity, because it's such a broad term that includes so many processes. If Map kinase kinase kinase activity is significant, do you really want to ignore that in hopes that signal transduction by protein phosphorylation will remain so?
You should also note that there are two sources of power here. You obviously lose power with increasing comparisons. But you also lose power by dilution of the activity. Using the above example, say there are 20 Map kinase kinase kinase activity genes, and 8 are in your set of significant genes. That will probably end up being significant. But let's also say that there are 60 Map kinase kinase activity genes, and only the 8 Map kinase kinase kinase activity genes are significant. If you just test that term, you may no longer achieve significance because you diluted out the results with the other child terms of Map kinase kinase activity that weren't significant (btw this is all made up. Well, except for the fact that Map kinase kinase kinase activity is a grand child term of Map kinase kinase activity, and an ancestor of signal transduction by protein phosphorylation. But the numbers are fake, and only for illustration purposes).
Anyway, I don't think you are going to necessarily gain power by doing this. You could hypothetically take the GO BP terms from the Broad, cut out just the ones you like, and then do a competitive gene set test with say
romer from the limma package.
If you really do want to do this, the code is Open Source and freely available. The mappings are done by a function called getGoToEntrezMap_db that is in the file categoryToEntrezBuilder-methods.R of the Category package. You could clone the sources from github, make whatever changes you like, and then install your forked package. That might be a fun project.