Question

Is it common to get all 0 values when running the randomForest step in GOexpress?

0

Entering edit mode

willj ▴ 30

@willj-8763

Last seen 7.2 years ago

France

The examples in the GOexpress vignette run correctly for me and I've now ran various subsets of my own data. The output looks generally sensible - identified GO terms are similar to those I find by other methods (e.g. limma differential expression followed by DAVID functional enrichment). However, I'm a bit worried because usually (with my own data) the values output at the random Forest step are all or mostly zeroes, as shown here:

GOexpress_results <- GO_analyse(
   eSet = eset4GO, f = "clonal_limma",
   GO_genes=GOgenes.Ensembl,
   all_GO=allGO.Ensembl,
   all_genes=allgenes.Ensembl
)

Using custom GO_genes mapping ...
9623 features from ExpressionSet found in the mapping table.
Using custom GO terms description ...
Analysis using method randomForest on factor clonal_limma for 10671 genes. This may take a few minutes ...
ntree      OOB      1      2
  100:   0.00%  0.00%  0.00%
  200:   0.00%  0.00%  0.00%
  300:   0.00%  0.00%  0.00%
  400:   0.00%  0.00%  0.00%
  500:   0.00%  0.00%  0.00%
  600:   0.00%  0.00%  0.00%
  700:   0.00%  0.00%  0.00%
  800:   0.00%  0.00%  0.00%
  900:   0.00%  0.00%  0.00%
 1000:   0.00%  0.00%  0.00%
Using custom gene descriptions ...
Merging score into result table ...

Is this something to do with overfitting? Does it mean that my downstream results are suspect?

Thanks for any help (I can give more of my input/output if needed - just tell me what to provide).

goexpress • 1.5k views

ADD COMMENT • link updated 9.1 years ago by kevin.rue ▴ 350 • written 9.1 years ago by willj ▴ 30

score 3 · Accepted Answer · 2015-10-21

Dear willj,

Having 0s at the random forest step is actually a very good sign for you data set. It means gene expression levels are very good at classifying your experimental groups.

These zeros represent the "out-of-bag" (OOB) proportions, in other words, the proportion of misclassified samples from each experimental group, across all the random classification trees generated so far. To be completely clear, the column "OOB" is the average proportion of misclassified samples across all groups, and the subsequent columns are the proportion of misclassified samples from each group. (Considering a data set of two groups of 10 samples each, the OOB will be the propotion out of 20, and the next two columns the proportions out of 10)

The lower the OOB proportions, the better the classification, as you have fewer misclassified samples!

Over-fitting is always a risk in machine-learning, but I do like the random forest for its sub-sampling of genes, which sometimes excludes genes with the strongest effect from the classification (in iterations where those genes are not sampled), in which cases genes with weaker effects are also explored for their capacity to classify samples.

I hope that helps :)

Kévin