First some disclaimer:
1. I don't work with gene expression data, so lack the insights that
2. I maintain the randomForest package, and use it a lot, so count on
Now, if Karen's objective is finding differentially expressed genes, I
that randomForest is an overkill. However, for classification as well
data exploration, randomForest can be a very handy tool. What we have
found, through both simulated and real (non-genomic) data, is that the
variable importance measures can be very effective. I don't see
wrong with using it to identify potentially "interesting" genes.
There are some points to keep in mind, though:
1. We had found "measure 1" of variable importance to be
some situations, and not very stable even with large number of trees.
had decided to abandon measures 1 and 3. In the next version of the
package, only measures 2 and 4 are computed. Both of these are quite
(with, say, 500 or more trees).
2. In most cases that we have seen, randomForest is extremely
noise variables, in the sense that the cross-validated error rates do
improve significantly as number of variables are reduced, for data
where we know there are large number of noise variables. While
number of variables may be a necessity for other classifiers, it
affect RF much most of the time.
3. Considering #2 above, the value of the importance measures is
mostly for "inpterpretation" or exploration. There's an obvious
though: The measures do not give any hints on trend/directions. To
further insight on the structure of the data, one should use the
provided by variable importance and carry out further exploration with
tools (e.g., fit more "interpretable" models using the most important
variables, but be careful not to read too much into performance of
models, as selection bias had crept in).
That's my $0.02 for the day...
> -----Original Message-----
> From: Nicholas Lewin-Koh [mailto:email@example.com]
> Sent: Monday, March 24, 2003 10:52 PM
> To: Karen.Chancellor@asu.edu
> Cc: firstname.lastname@example.org
> Subject: Re:[BioC] feature selection
> Hi Karen,
> I don't know that starting with randomForest and using the
> values is the best way to start. I would suggest first filtering the
> data in different ways, like 200 largest F values. If your question
> to identify differentially expressed genes than you really want a
> multiple comparisons approach. The multcomp package is quite good.
> the interest is a classification rule try filtering in different
> as suggested above, and then try some exploratory
> discriminant analysis.
> I have gotten good results with the fda function in the mda package
> CRAN. Use the gen.ridge method option and that gives penalized
> discriminant analysis. This can help to look at the
> projections and just
> determine if the states are seperable. You can also look at the
> coefficients for each variable. After some careful EDA than go for
> Karen writes>
> Hello Bioconductor folk,
> Can any of the bioconductor packages be used on a .pcl file,
> rather than
> starting with the raw data?
> I am starting with a .pcl file containing approximately 900
> genes and 50
> samples, which I have read using read.table. The classification is
> known, and
> there are 3 classes of samples. I am interested in reducing the
> genes. I would like to use the R RandomForest package for this task.
> Is this appropriate? I'm new to this so will appreciate any help.
> Bioconductor mailing list