The examples in the GOexpress vignette run correctly for me and I've now ran various subsets of my own data. The output looks generally sensible - identified GO terms are similar to those I find by other methods (e.g. limma differential expression followed by DAVID functional enrichment). However, I'm a bit worried because usually (with my own data) the values output at the random Forest step are all or mostly zeroes, as shown here:
GOexpress_results <- GO_analyse( eSet = eset4GO, f = "clonal_limma", GO_genes=GOgenes.Ensembl, all_GO=allGO.Ensembl, all_genes=allgenes.Ensembl )
Using custom GO_genes mapping ... 9623 features from ExpressionSet found in the mapping table. Using custom GO terms description ... Analysis using method randomForest on factor clonal_limma for 10671 genes. This may take a few minutes ... ntree OOB 1 2 100: 0.00% 0.00% 0.00% 200: 0.00% 0.00% 0.00% 300: 0.00% 0.00% 0.00% 400: 0.00% 0.00% 0.00% 500: 0.00% 0.00% 0.00% 600: 0.00% 0.00% 0.00% 700: 0.00% 0.00% 0.00% 800: 0.00% 0.00% 0.00% 900: 0.00% 0.00% 0.00% 1000: 0.00% 0.00% 0.00% Using custom gene descriptions ... Merging score into result table ...
Is this something to do with overfitting? Does it mean that my downstream results are suspect?
Thanks for any help (I can give more of my input/output if needed - just tell me what to provide).
Great - thanks.