Question: RandomForest, supervised machine learning and uncertainty
0
8.7 years ago by
January Weiner370 wrote:
Dear all, I am using RandomForests for supervised machine learning. My set of biomarkers is quite good at distinguishing the samples from different classes. However, I would get an even better classification if I could introduce a class of "Unknown" or "Unclassified" samples. Given that alrf is the RF object alrf <- randomForest( group ~ ., data=all ) I take a look at the matrix alrf$votes. I notice that in almost all the misclassified cases, the votes were close to a tie; there were also some correctly classified cases close to a tie. If I define an additional group called "Undefined", this group will be larger than the percentage of missclassified cases (as some correctly annotated cases will go into that class). However, the error rate *outside* of the class will be almost negligible. From a purely pragmatic point of view in biomarker discovery such a situation is preferable: it's better to admit that you don't know something than to risk a misclassification. And here is my question: Is there a standard method of creating such a class? For example, for a given sample i, I use sum( ( votes[i,] - max( votes[i,] ) )^2 ) or the difference between the two top votes for a given sample. But I think that this approach is not sufficient. Best regards, j. -- -------- Dr. January Weiner 3 -------------------------------------- Max Planck Institute for Infection Biology Charit?platz 1 D-10117 Berlin, Germany Web?? : www.mpiib-berlin.mpg.de Tel? ?? : +49-30-28460514 go classification • 721 views ADD COMMENTlink modified 8.7 years ago by Vincent J. Carey, Jr.6.3k • written 8.7 years ago by January Weiner370 Answer: RandomForest, supervised machine learning and uncertainty 0 8.7 years ago by United States Vincent J. Carey, Jr.6.3k wrote: On Wed, Dec 8, 2010 at 5:43 AM, January Weiner <january.weiner at="" mpiib-berlin.mpg.de=""> wrote: > Dear all, > > I am using RandomForests for supervised machine learning. My set of > biomarkers is quite good at distinguishing the samples from different > classes. > > However, I would get an even better classification if I could > introduce a class of "Unknown" or "Unclassified" samples. Given that > alrf is the RF object > > alrf <- randomForest( group ~ ., data=all ) > > I take a look at the matrix alrf$votes. I notice that in almost all > the misclassified cases, the votes were close to a tie; there were > also some correctly classified cases close to a tie. > > If I define an additional group called "Undefined", this group will be > larger than the percentage of missclassified cases (as some correctly > annotated cases will go into that class). However, the error rate > *outside* of the class will be almost negligible. From a purely > pragmatic point of view in biomarker discovery such a situation is > preferable: it's better to admit that you don't know something than to > risk a misclassification. > > And here is my question: > > Is there a standard method of creating such a class? ?For example, for > a given sample i, I use sum( ( votes[i,] - max( votes[i,] ) )^2 ) or > the difference between the two top votes for a given sample. But I > think that this approach is not sufficient. I don't think there is anything like a "standard method" for this task, but if I read you correctly you are addressing the extension of the decision task from two classes to two classes plus "doubt". This is discussed at some length in Ripley's "Pattern Recognition and Neural Networks" book; see the comments on the "error-reject" curve on p20 and on "safety threshold" concept on p22. The MLInterfaces vignette has an application (that, as written, turns out to be nugatory) just at the end of the vignette -- the doubt interval is too narrow to capture any classification for the data in use. If you change the code to douPred[smallDou(0.35, 0.65)] <- "doubt" one prediction is converted to "doubt". This issue deserves more attention. > > Best regards, > > j. > > -- > -------- Dr. January Weiner 3 -------------------------------------- > Max Planck Institute for Infection Biology > Charit?platz 1 > D-10117 Berlin, Germany > Web?? : www.mpiib-berlin.mpg.de > Tel? ?? : +49-30-28460514 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >