Question

CAMERA + Gene Ontology Gene Sets

0

Entering edit mode

Tyler Sagendorf ▴ 10

@df6c68e9

Last seen 6 days ago

United States

In this reply, it is stated that Correlation Adjusted MEan RAnk gene set testing (CAMERA) is not intended for use with Gene Ontology (GO) gene sets due to their "redundancy and lack of directionality", though I could not find any other explicit mention of this. I imagine it is something to do with the inner workings of CAMERA, though the exact reason is unfortunately lost on me.

After restricting the GO sets to only those genes quantified in the experimental data and applying reasonable set size filters [10, G - 1], suppose the redundancy could be mostly addressed by using a hierarchical clustering approach similar to what is described in the MSigDB v7.0 Release Notes (sections 3.2 and 3.7). In that case, would the lack of directionality of the GO still be enough of a problem to warrant a different competitive test? Would limma::geneSetTest be more appropriate, despite not accounting for inter-gene correlation?

Note: I am aware that, in the RNA-Seq Analysis is Easy as 1, 2, 3 vignette and the Gene-expression data integration to squamous cell lung cancer subtypes reveals drug sensitivity publication, CAMERA is used in conjunction with sets from the C2 collection of the Molecular Signatures Database (MSigDB), where a number of terms (though perhaps not all) indicate directionality with "_UP" or "_DN" suffixes.

GO limma CAMERA • 526 views

ADD COMMENT • link updated 5 months ago by Gordon Smyth 51k • written 5 months ago by Tyler Sagendorf ▴ 10

score 1 · Answer 1 · 2024-02-10

GO is a very large collection of gene annotation terms. It doesn't work terribly well with competitive gene sets because it contains a lot of very broad terms and because the GO terms are mostly non-directional. The GO term for a particular biological process will typically contain all genes loosely associated with that process, including inhibitors as well as promoters of the process. So a GO term might correspond to a highly relevant biological process but still not be strongly up-regulated or down-regulated in the DE results.

The GO collection is also hierarchical, with all GO terms being subsets of their parent sets.

There is no mathematical or statistical reason why you can't run CAMERA on GO gene sets, but the nature of the GO collection means that (i) statistical power might be reduced and (ii) the interpretation of the results might not be clear. These are scientific issues rather than mathematical issues and they affect all competitive gene tests such as CAMERA, geneSetTest, GSEA etc. It is not a hidden issue to do with the inner workings of CAMERA!

The MSigDB collection prunes the GO term collection to remove genesets that are too large or too small and to reduce redundancy. If you want to run CAMERA on GO terms, then it would indeed to be a good idea to use the curated GO collection from the MSigDB. The limma team provides the MSigDB collection in an R-friendly format ready to be input into limma and CAMERA, see:

https://bioinf.wehi.edu.au/MSigDB/

Beware though that our mouse version of MSigDB recreates the GO gene sets from scratch using mouse annotation without the same sort of set pruning as done by the Broad Institute. Maybe we should revisit this, but I thought it better to allow users to do their own pruning.

Alternatively, the set size filter that you mention would also be a big help and would substantially mitigate the problems.

In my own work, I have tended to test GO terms using overlap tests, i.e., using goana and kegga rather than camera, largely because the simple overlap tests aren't affected by the GO term redundancies. Both approaches have their advantages. If we went to the trouble of pruning the GO gene set collection for mouse, then it might well make sense for us to use CAMERA more often with GO.