Is it common to get all 0 values when running the randomForest step in GOexpress?
1
0
Entering edit mode
willj ▴ 30
@willj-8763
Last seen 7.2 years ago
France

The examples in the GOexpress vignette run correctly for me and I've now ran various subsets of my own data. The output looks generally sensible - identified GO terms are similar to those I find by other methods (e.g. limma differential expression followed by DAVID functional enrichment). However, I'm a bit worried because usually (with my own data) the values output at the random Forest step are all or mostly zeroes, as shown here: 

GOexpress_results <- GO_analyse(
   eSet = eset4GO, f = "clonal_limma",
   GO_genes=GOgenes.Ensembl,
   all_GO=allGO.Ensembl,
   all_genes=allgenes.Ensembl
)
Using custom GO_genes mapping ...
9623 features from ExpressionSet found in the mapping table.
Using custom GO terms description ...
Analysis using method randomForest on factor clonal_limma for 10671 genes. This may take a few minutes ...
ntree      OOB      1      2
  100:   0.00%  0.00%  0.00%
  200:   0.00%  0.00%  0.00%
  300:   0.00%  0.00%  0.00%
  400:   0.00%  0.00%  0.00%
  500:   0.00%  0.00%  0.00%
  600:   0.00%  0.00%  0.00%
  700:   0.00%  0.00%  0.00%
  800:   0.00%  0.00%  0.00%
  900:   0.00%  0.00%  0.00%
 1000:   0.00%  0.00%  0.00%
Using custom gene descriptions ...
Merging score into result table ...

Is this something to do with overfitting? Does it mean that my downstream results are suspect?

Thanks for any help (I can give more of my input/output if needed - just tell me what to provide).

goexpress • 1.5k views
ADD COMMENT
3
Entering edit mode
kevin.rue ▴ 350
@kevinrue-6757
Last seen 6 months ago
University of Oxford

Dear willj,

Having 0s at the random forest step is actually a very good sign for you data set. It means gene expression levels are very good at classifying your experimental groups.

These zeros represent the "out-of-bag" (OOB) proportions, in other words, the proportion of misclassified samples from each experimental group, across all the random classification trees generated so far. To be completely clear, the column "OOB" is the average proportion of misclassified samples across all groups, and the subsequent columns are the proportion of misclassified samples from each group. (Considering a data set of two groups of 10 samples each, the OOB will be the propotion out of 20, and the next two columns the proportions out of 10)

The lower the OOB proportions, the better the classification, as you have fewer misclassified samples!

Over-fitting is always a risk in machine-learning, but I do like the random forest for its sub-sampling of genes, which sometimes excludes genes with the strongest effect from the classification (in iterations where those genes are not sampled), in which cases genes with weaker effects are also explored for their capacity to classify samples.

I hope that helps :)

Kévin

ADD COMMENT
0
Entering edit mode

Great - thanks.

ADD REPLY

Login before adding your answer.

Traffic: 862 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6