18 months ago by

University of Oxford

Dear Martin,

*"I'm having difficulty understanding what the returned pvalues because there's no mention what what the contrasts are. [...] It appears that the pvalue returned is for the overall effect a GO term across all factors (e.g., sort of like an F statistic)."*

Indeed you are right: GOexpress was designed to estimate how well GO discriminate *all* levels of the factor analysed. As Gordon pointed out, there are already various methods designed to address pairwise contrasts.

Now, regarding the actual meaning of the P-value, I assume you already had a look at the vignette, section "Permutation-based P-value for ontologies", which I reference below:

*"To assess the significance of GO term ranking — or scoring —, we implemented a permutation-based function randomising the gene feature ranking, and counting how many times each GO term is ranked (scored) equal or higher than the real rank (score)."*

In other words, the P-values do not estimate *enrichment *of GO terms within a gene list (the random forest return the full list of genes). Rather P-values estimate how often each GO term would rank/score higher by chance, than they did in the ranking/scoring of genes obtained from the random forest (RF).

More clearly, P-values are obtained as follows:

- the full list of genes is randomised N times,
- for each randomisation, the rank/score of each GO term is calculated and compared to rank/score obtained by the RF
- P-value(GOterm) = {count of randomisation where GOterm ranked/scored better than in RF} / {N}

As a consequence, to answer your other question, this is indeed equivalent to a *one-tailed test*. However, not for *enrichment of genes in a list* but rather for *significance of the rank/score* relative to a random gene list.

A correlate of the above explanation is that terms are not contrasted in any direction between the groups of samples, rather they are tested for the frequency at which their associated genes rank/score higher than expected by chance in the RF in the task of classifying the samples into their known groups.

Given your interest in pairwise contrasts, I think Gordon's suggestion is better suited. But for what it is worth, your use of subset() would have been exactly my suggestion for an alternative approach using GOexpress. Actually, I would be quite interested if you were keen to run both and share your impression of the comparison.

All the best,

Kevin