Question

Best options for cross validation machine learning

0

Entering edit mode

Daniel Brewer ★ 1.9k

@daniel-brewer-1791

Last seen 9.6 years ago

Hello, I have a microarray dataset which I have performed an unsupervised Bayesian clustering algorithm on which divides the samples into four groups. What I would like to do is: 1) Pick a group of genes that best predict which group a sample belongs to. 2) Determine how stable these prediction sets are through some sort of cross-validation (I would prefer not to divide my set into a training and test set for stage one) These steps fall into the supervised machine learning realm which I am not familiar with and googling around the options seem endless. I was wondering whether anyone could suggest reasonable well-established algorithms to use for both steps. Many thanks Dan -- ************************************************************** Daniel Brewer, Ph.D. Institute of Cancer Research Molecular Carcinogenesis Email: daniel.brewer at icr.ac.uk ************************************************************** The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the a...{{dropped:2}}

Microarray Clustering Cancer Microarray Clustering Cancer • 1.5k views

ADD COMMENT • link updated 14.3 years ago by Joern Toedling ▴ 450 • written 14.3 years ago by Daniel Brewer ★ 1.9k

score 0 · Answer 1 · 2010-01-19

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 3 months ago

United States

On Tue, Jan 19, 2010 at 11:11 AM, Daniel Brewer <daniel.brewer at="" icr.ac.uk=""> wrote: > Hello, Hi, Dan. > I have a microarray dataset which I have performed an unsupervised > Bayesian clustering algorithm on which divides the samples into four > groups. ?What I would like to do is: > 1) Pick a group of genes that best predict which group a sample belongs to. Feature selection.... > 2) Determine how stable these prediction sets are through some sort of > cross-validation (I would prefer not to divide my set into a training > and test set for stage one) Cross-validation.... Note that for cross-validation, steps 1 and 2 necessarily need to be done together. > These steps fall into the supervised machine learning realm which I am > not familiar with and googling around the options seem endless. ?I was > wondering whether anyone could suggest reasonable well-established > algorithms to use for both steps. Check out the MLInterfaces package. There are MANY methods that could be applied. It really isn't possible to boil this down to an email answer, unfortunately. Sean

ADD COMMENT • link 14.3 years ago Sean Davis 21k

0

Entering edit mode

Hi, On Tue, Jan 19, 2010 at 12:09 PM, Sean Davis <seandavi at="" gmail.com=""> wrote: > On Tue, Jan 19, 2010 at 11:11 AM, Daniel Brewer <daniel.brewer at="" icr.ac.uk=""> wrote: >> Hello, > > Hi, Dan. > >> I have a microarray dataset which I have performed an unsupervised >> Bayesian clustering algorithm on which divides the samples into four >> groups. ?What I would like to do is: >> 1) Pick a group of genes that best predict which group a sample belongs to. > > Feature selection.... > >> 2) Determine how stable these prediction sets are through some sort of >> cross-validation (I would prefer not to divide my set into a training >> and test set for stage one) > > Cross-validation.... > > Note that for cross-validation, steps 1 and 2 necessarily need to be > done together. > >> These steps fall into the supervised machine learning realm which I am >> not familiar with and googling around the options seem endless. ?I was >> wondering whether anyone could suggest reasonable well-established >> algorithms to use for both steps. > > Check out the MLInterfaces package. ?There are MANY methods that could > be applied. ?It really isn't possible to boil this down to an email > answer, unfortunately. While this is absolutely true, one could always offer a simple suggestion :-) A very easy (for you (Daniel)) thing to do would be to try to use the glmnet package and perform logistic regression to build several (four) one-against-all type of classifiers. The nice thing about using glmnet is that it uses "the lasso" (or elastic) regularizer to help cope with your (likely) "p >> n" problem, and returns to you a model with few coefficients that can best-predict in the scenario you've given it. So, by giving it an "appropriate scenario" you essentially get the ever-covetted-and-quite-controversial "gene signature" for your group/phenotype of interest. You'll of course have to do cross-validation/etc, which as Sean+Kasper have pointed out is essential and (by definition) that you need to split your data into (several) training/test sets. I agree with Kasper's final sentiment as well ... but while you most likely won't get a patent for some diagnostic indicator (of whatever), it doesn't mean that the genes in your "signature" won't be informative for further downstream analysis (eg. to help direct further bench experiments (after more analysis, of course)). Lastly, if you extract your expression data into a matrix and are comfortable working with it that way, you can also look at the CRAN/caret package for functionality that's similar to MLInterface to help setup your data to do cross validation, etc. In fact, there is a nice paper written by the author of the caret package that shows you how to use caret, which might not hurt to read anyway if this type of stuff is new to you: http://www.jstatsoft.org/v28/i05 Hope that helps, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 14.3 years ago Steve Lianoglou ★ 13k

score 0 · Answer 2 · 2010-01-19

On Jan 19, 2010, at 11:11 AM, Daniel Brewer wrote: > 1) Pick a group of genes that best predict which group a sample belongs to. > 2) Determine how stable these prediction sets are through some sort of > cross-validation (I would prefer not to divide my set into a training > and test set for stage one) If you don't do this (the statement in ()) you will most likely get crap. Note the astounding amount of papers in the literature that have attempted to do this. And note that these papers never gets replicated, most likely because the statistical analysis is overly optimistic. The track record for being able to do this is extremely bad, despite the number of papers claiming that their signature method is like 99% accurate. Kasper

score 0 · Answer 3 · 2010-01-21

Message: 9 Date: Tue, 19 Jan 2010 16:11:14 +0000 From: Daniel Brewer <daniel.brewer at="" icr.ac.uk=""> To: Bioconductor mailing list <bioconductor at="" stat.math.ethz.ch=""> Subject: [BioC] Best options for cross validation machine learning Content-Type: text/plain; charset=ISO-8859-1 Hi Dan, Hello, I have a microarray dataset which I have performed an unsupervised Bayesian clustering algorithm on which divides the samples into four groups. What I would like to do is: 1) Pick a group of genes that best predict which group a sample belongs to. 2) Determine how stable these prediction sets are through some sort of cross-validation (I would prefer not to divide my set into a training and test set for stage one) These steps fall into the supervised machine learning realm which I am not familiar with and googling around the options seem endless. I was wondering whether anyone could suggest reasonable well- established algorithms to use for both steps. Have a look at: [1]http://cran.ms.unimelb.edu.au/web/views/MachineLearning.html I would suggest going through the literature and looking at some papers that have dealt with your type of data as some of these packages are really aimed at specific types of data, e.g. tumor classification, survival data. E.g see [2]http://www.pnas.org/content/98/19/10869.[3]abstract Many thanks Dan -- ************************************************************** Daniel Brewer, Ph.D. Institute of Cancer Research Molecular Carcinogenesis Email: daniel.brewer at icr.ac.uk ************************************************************** The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. Lavinia Gordon Research Officer Bioinformatics Murdoch Childrens Research Institute Royal Children's Hospital Flemington Road Parkville Victoria 3052 Australia telephone: +61 3 8341 6221 [4]www.mcri.edu.au This e-mail and any attachments to it (the "Communication") are, unless otherwise stated, confidential, may contain copyright material and is for the use only of the intended recipient. If you receive the Communication in error, please notify the sender immediately by return e-mail, delete the Communication and the return e-mail, and do not read, copy, retransmit or otherwise deal with it. Any views expressed in the Communication are those of the individual sender only, unless expressly stated to be those of Murdoch Childrens Research Institute (MCRI) ABN 21 006 566 972 or any of its related entities. MCRI does not accept liability in connection with the integrity of or errors in the Communication, computer virus, data corruption, interference or delay arising from or in respect of the Communication. Please consider the environment before printing this email References 1. http://cran.ms.unimelb.edu.au/web/views/MachineLearning.html 2. http://www.pnas.org/content/98/19/10869.abstract 3. http://www.pnas.org/content/98/19/10869.abstract 4. http://www.mcri.edu.au/

score 0 · Answer 4 · 2010-01-21

Hi Dan, one more suggestion, a few former colleagues of mine used to teach the statistical reasoning for addressing these problems and how to solve them in R in a very accessible way. Have a look at the course material from that course here: http://compdiag.molgen.mpg.de/ngfn/pma2005nov.shtml especially Day 3: Molecular Diagnosis may of relevance for you. Regards, Joern On Tue, 19 Jan 2010 16:11:14 +0000, Daniel Brewer wrote > Hello, > > I have a microarray dataset which I have performed an unsupervised > Bayesian clustering algorithm on which divides the samples into four > groups. What I would like to do is: > 1) Pick a group of genes that best predict which group a sample > belongs to. 2) Determine how stable these prediction sets are > through some sort of cross-validation (I would prefer not to divide > my set into a training and test set for stage one) > > These steps fall into the supervised machine learning realm which I > am not familiar with and googling around the options seem endless. > I was wondering whether anyone could suggest reasonable well- established > algorithms to use for both steps. > > Many thanks > > Dan --- Joern Toedling Institut Curie -- U900 26 rue d'Ulm, 75005 Paris, FRANCE Tel. +33 (0)156246927