Search
Question: Contrasts in GOexpress analysis?
0
21 months ago by
mforde8410
mforde8410 wrote:

Hi,

I'm a little confused by what's actually going on with the analysis for this package.

The main function can take a factor of any size and returns enriched GO terms and pval based upon those factors using random forest regressions.

I'm having difficulty understanding what the returned pvalues because there's no mention what what the contrasts are. For example, say I ran the analysis with a treatment factor having three independent levels. It appears that the pvalue returned is for the overall effect a GO term across all factors (e.g., sort of like an F statistic). I'd like to be able to do contrasts between groups, e.g., grps1 - 2, grps1 - 3, and grps2 - 3.  I was able to use subEset() function to run a 2 factor analysis (e.g., subEset(eSet, subset=list(Treatment=c("grp1", "grp2"))). I'm assuming this is a one tail test (e.g., only significant term enrichment and doesn't include depletion) I wasn't sure enriched terms are being contrasted grp1-2 or grp2-1, or if it's a combination of both.

Any one have any thoughts on what these pvalues mean or suggestions of R packages which might be better suited for the contrasts I want to analyze?

Martin

modified 21 months ago by kevin.rue220 • written 21 months ago by mforde8410
3
21 months ago by
kevin.rue220
University of Oxford
kevin.rue220 wrote:

Dear Martin,

"I'm having difficulty understanding what the returned pvalues because there's no mention what what the contrasts are. [...] It appears that the pvalue returned is for the overall effect a GO term across all factors (e.g., sort of like an F statistic)."

Indeed you are right: GOexpress was designed to estimate how well GO discriminate all levels of the factor analysed. As Gordon pointed out, there are already various methods designed to address pairwise contrasts.

Now, regarding the actual meaning of the P-value, I assume you already had a look at the vignette, section "Permutation-based P-value for ontologies", which I reference below:

"To assess the significance of GO term ranking — or scoring —, we implemented a permutation-based function randomising the gene feature ranking, and counting how many times each GO term is ranked (scored) equal or higher than the real rank (score)."

In other words, the P-values do not estimate enrichment of GO terms within a gene list (the random forest return the full list of genes). Rather P-values estimate how often each GO term would rank/score higher by chance, than they did in the ranking/scoring of genes obtained from the random forest (RF).
More clearly, P-values are obtained as follows:

• the full list of genes is randomised N times,
• for each randomisation, the rank/score of each GO term is calculated and compared to rank/score obtained by the RF
• P-value(GOterm) = {count of randomisation where GOterm ranked/scored better than in RF} / {N}

As a consequence, to answer your other question, this is indeed equivalent to a one-tailed test. However, not for enrichment of genes in a list but rather for significance of the rank/score relative to a random gene list.
A correlate of the above explanation is that terms are not contrasted in any direction between the groups of samples, rather they are tested for the frequency at which their associated genes rank/score higher than expected by chance in the RF in the task of classifying the samples into their known groups.

Given your interest in pairwise contrasts, I think Gordon's suggestion is better suited. But for what it is worth, your use of subset() would have been exactly my suggestion for an alternative approach using GOexpress. Actually, I would be quite interested if you were keen to run both and share your impression of the comparison.

All the best,
Kevin

Hi Kevin,

Thanks for the clarifications. I'll run some test tomorrow, and do some comparisons and see what looks suitable. Again, thanks for the nice response.

Martin

2
21 months ago by
Gordon Smyth35k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth35k wrote:

Do you require random forest regression? Since you're already using the limma package ( limma -- posthoc analysis when design factor has 3 levels ), have you tried the goana function for a more traditional GO analysis?

I'll continue the code that Aaron gave you in response to your previous post. If your data is human and has Entrez Gene IDs as row names, you might proceed:

group <- factor(rep(1:3, c(nGrp1, nGrp2, nGrp3))
design <- model.matrix(~0 + group)
con <- makeContrasts(
G1vs2 = group1 - group2,
G1vs3 = group1 - group3,
G3vs2 = group3 - group2, levels=design)
fit <- lmFit(y, design)
​fit2 <- contrasts.fit(fit, con)
fit2 <- eBayes(fit2)
g <- goana(fit2, coef="G1vs2", species="Hs")
topGO(g)
g <- goana(fit2, coef="G1vs3", species="Hs")
topGO(g)

etc.