How to do k-fold validation using SVM

0

Entering edit mode

Song, Guangchun ▴ 30

@song-guangchun-109

Last seen 11.5 years ago

Did anyone know how to do the k-fold validation on the training data set by SVM? Thanks. Guangchun

• 3.3k views

ADD COMMENT • link updated 23.1 years ago by Stephen Henderson ★ 1.0k • written 23.1 years ago by Song, Guangchun ▴ 30

0

Entering edit mode

rgentleman ★ 5.5k

@rgentleman-7725

Last seen 10.8 years ago

United States

On Fri, Jan 24, 2003 at 01:26:07PM -0000, Stephen Henderson wrote: > No not before you start but after each fold, so that each training round > uses a slightly different set of genes/features. Typically you need to do some filtering (what I've been calling non-specific ) before any model fitting. Genes that show little variation across samples are not interesting and can be excluded. Then inside of cv, I usually do something like: cvX <- function(data, filter, otherargs) where filter is a function that takes an exprSet and returns the appropriate subset. On each iteration, apply filter to the training set, and then build the model, and test. If you make the function a parameter to the cv function then you can change your gene selection method (say from t-test to ROC) without having to do much more than write a new gene selection method. > > -----Original Message----- > From: Robert Gentleman [mailto:rgentlem@jimmy.harvard.edu] > Sent: Friday, January 24, 2003 1:16 PM > To: Stephen Henderson > Subject: Re: [BioC] How to do k-fold validation using SVM > > On Fri, Jan 24, 2003 at 01:08:23PM -0000, Stephen Henderson wrote: > > Is there a simple??? way to do a gene/feature selection for each round of > > cross validation-- using the ipred errorest function? > > > do you mean take a subset before you start? there is a whole package > called genefilter that does all sorts of things in that regard? > > robert > > > > I do not mean select some set of genes and then do a cv on this subset, > but > > rather to reselect the subset for each fold? > > > > I had written a rather long winded loop previous to this posting (had > missed > > ipred) but now wonder if there is a shortcut? > > > > -----Original Message----- > > From: Torsten Hothorn [mailto:Torsten.Hothorn@rzmail.uni- erlangen.de] > > Sent: Friday, January 24, 2003 7:13 AM > > To: Adaikalavan Ramasamy > > Cc: Song, Guangchun; bioconductor@stat.math.ethz.ch > > Subject: RE: [BioC] How to do k-fold validation using SVM > > > > On Fri, 24 Jan 2003, Adaikalavan Ramasamy wrote: > > > > > You might want to use the function svm() in the e1071 library with the > > > option 'cross'. > > > > > > Or you can manually break the dataset into k subsets and write a loop. > > > This might be better if you prefer to do stratified sampling for the > > > fold rather than random sampling. > > > > > > > or you can use the "errorest" function in the ipred-package (see R News > > 2(2) for examples) > > > > Torsten > > > > > -----Original Message----- > > > From: Song, Guangchun [mailto:Guangchun.Song@stjude.org] > > > Sent: Friday, January 24, 2003 7:35 AM > > > To: bioconductor@stat.math.ethz.ch > > > Subject: [BioC] How to do k-fold validation using SVM > > > > > > > > > > > > Did anyone know how to do the k-fold validation on the training data set > > > by SVM? > > > > > > Thanks. > > > > > > > > > Guangchun > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor@stat.math.ethz.ch > > > http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor@stat.math.ethz.ch > > > http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > > > > > > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > > > > > ********************************************************************** > > This email and any files transmitted with it are confidential an ... > [[dropped]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > -- > +------------------------------------------------------------------- -------- > + > | Robert Gentleman phone : (617) 632-5250 > | > | Associate Professor fax: (617) 632-2444 > | > | Department of Biostatistics office: M1B20 > | Harvard School of Public Health email: rgentlem@jimmy.dfci.harvard.edu > | > +------------------------------------------------------------------- -------- > + > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor -- +--------------------------------------------------------------------- ------+ | Robert Gentleman phone : (617) 632-5250 | | Associate Professor fax: (617) 632-2444 | | Department of Biostatistics office: M1B20 | Harvard School of Public Health email: rgentlem@jimmy.dfci.harvard.edu | +--------------------------------------------------------------------- ------+

ADD COMMENT • link 23.1 years ago rgentleman ★ 5.5k

0

Entering edit mode

Adaikalavan Ramasamy ▴ 140

@adaikalavan-ramasamy-167

Last seen 11.5 years ago

You might want to use the function svm() in the e1071 library with the option 'cross'. Or you can manually break the dataset into k subsets and write a loop. This might be better if you prefer to do stratified sampling for the fold rather than random sampling. -----Original Message----- From: Song, Guangchun [mailto:Guangchun.Song@stjude.org] Sent: Friday, January 24, 2003 7:35 AM To: bioconductor@stat.math.ethz.ch Subject: [BioC] How to do k-fold validation using SVM Did anyone know how to do the k-fold validation on the training data set by SVM? Thanks. Guangchun _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD COMMENT • link 23.1 years ago Adaikalavan Ramasamy ▴ 140

0

Entering edit mode

On Fri, 24 Jan 2003, Adaikalavan Ramasamy wrote: > You might want to use the function svm() in the e1071 library with the > option 'cross'. > > Or you can manually break the dataset into k subsets and write a loop. > This might be better if you prefer to do stratified sampling for the > fold rather than random sampling. > or you can use the "errorest" function in the ipred-package (see R News 2(2) for examples) Torsten > -----Original Message----- > From: Song, Guangchun [mailto:Guangchun.Song@stjude.org] > Sent: Friday, January 24, 2003 7:35 AM > To: bioconductor@stat.math.ethz.ch > Subject: [BioC] How to do k-fold validation using SVM > > > > Did anyone know how to do the k-fold validation on the training data set > by SVM? > > Thanks. > > > Guangchun > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > >

ADD REPLY • link 23.1 years ago Torsten Hothorn ▴ 30

0

Entering edit mode

Stephen Henderson ★ 1.0k

@stephen-henderson-71

Last seen 8.8 years ago

Is there a simple??? way to do a gene/feature selection for each round of cross validation-- using the ipred errorest function? I do not mean select some set of genes and then do a cv on this subset, but rather to reselect the subset for each fold? I had written a rather long winded loop previous to this posting (had missed ipred) but now wonder if there is a shortcut? -----Original Message----- From: Torsten Hothorn [mailto:Torsten.Hothorn@rzmail.uni-erlangen.de] Sent: Friday, January 24, 2003 7:13 AM To: Adaikalavan Ramasamy Cc: Song, Guangchun; bioconductor@stat.math.ethz.ch Subject: RE: [BioC] How to do k-fold validation using SVM On Fri, 24 Jan 2003, Adaikalavan Ramasamy wrote: > You might want to use the function svm() in the e1071 library with the > option 'cross'. > > Or you can manually break the dataset into k subsets and write a loop. > This might be better if you prefer to do stratified sampling for the > fold rather than random sampling. > or you can use the "errorest" function in the ipred-package (see R News 2(2) for examples) Torsten > -----Original Message----- > From: Song, Guangchun [mailto:Guangchun.Song@stjude.org] > Sent: Friday, January 24, 2003 7:35 AM > To: bioconductor@stat.math.ethz.ch > Subject: [BioC] How to do k-fold validation using SVM > > > > Did anyone know how to do the k-fold validation on the training data set > by SVM? > > Thanks. > > > Guangchun > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor ********************************************************************** This email and any files transmitted with it are confidential an ... [[dropped]]

ADD COMMENT • link 23.1 years ago Stephen Henderson ★ 1.0k

0

Entering edit mode

On Fri, 24 Jan 2003, Stephen Henderson wrote: > Is there a simple??? way to do a gene/feature selection for each round of > cross validation-- using the ipred errorest function? > by supplying your own `model' argument / function to `errorest': model <- function(formula, data, ....) { # 1) evaluate the formula, so you end up with # y (classes) and X (matrix of expression values, I guess) # look at any of the foo.formula methods or errorest.data.frame, for # example ... # 2) perform the gene / feature selection, so you endup with a # subset of features from X, say Xsub # 3) and finally call your favorite classifier and return its output randomForests(y ~ ., data=Xsub) } then `cv.factor' calls `model' for each fold. No idea if this is "simple" (no. 1 probably isn't) but relying on the `model(formula, data, ...)' interface makes it flexible. You can pass more arguments to `model' by "...": for example the number of genes to be selected... Hope this helps, Torsten > I do not mean select some set of genes and then do a cv on this subset, but > rather to reselect the subset for each fold? > > I had written a rather long winded loop previous to this posting (had missed > ipred) but now wonder if there is a shortcut? > > -----Original Message----- > From: Torsten Hothorn [mailto:Torsten.Hothorn@rzmail.uni- erlangen.de] > Sent: Friday, January 24, 2003 7:13 AM > To: Adaikalavan Ramasamy > Cc: Song, Guangchun; bioconductor@stat.math.ethz.ch > Subject: RE: [BioC] How to do k-fold validation using SVM > > On Fri, 24 Jan 2003, Adaikalavan Ramasamy wrote: > > > You might want to use the function svm() in the e1071 library with the > > option 'cross'. > > > > Or you can manually break the dataset into k subsets and write a loop. > > This might be better if you prefer to do stratified sampling for the > > fold rather than random sampling. > > > > or you can use the "errorest" function in the ipred-package (see R News > 2(2) for examples) > > Torsten > > > -----Original Message----- > > From: Song, Guangchun [mailto:Guangchun.Song@stjude.org] > > Sent: Friday, January 24, 2003 7:35 AM > > To: bioconductor@stat.math.ethz.ch > > Subject: [BioC] How to do k-fold validation using SVM > > > > > > > > Did anyone know how to do the k-fold validation on the training data set > > by SVM? > > > > Thanks. > > > > > > Guangchun > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > > ********************************************************************** > This email and any files transmitted with it are confidential and > intended solely for the use of the individual or entity to whom they > are addressed. If you have received this email in error please notify > the system manager (wibr.mail@ucl.ac.uk). All files are scanned for viruses. > ********************************************************************** > >

ADD REPLY • link 23.1 years ago Torsten Hothorn ▴ 30

0

Entering edit mode

--------------Boundary-00=_RIKDOLK72K5YKH5275KI Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Sorry for jumping into the thread so late: I just read the posting today. Anyway, I have used the following code a couple of times, and maybe it is= of=20 some help to you. In each "round" (for each set of training data) I selec= t=20 the best K genes (where best means with largest abs. value of t statistic= )=20 and then fit the svm using those K genes. For a couple of reasons, I use mt.maxT (from the multtest library) to get= the =20 t statistic, but you can modify the function at your convenience (like an= =20 ANOVA for > 3 or whatever you want). Note also that I use a linear kernel= =2E=20 Hope it helps, Ram=F3n gene.select <- function(data, class, size =3D NULL, threshold =3D NULL) { # t.stat <- apply(data, 2, function(x) {abs(t.test(x ~ class)$statisti= c)})=20 this is slower than tmp <- mt.maxT(t(data), class, B=3D 1)[, c(1, 2)] selected <- tmp[seq(1:size), 1] return(selected) } cross.valid.svm <- function(data, y, knumber =3D 10, size =3D 200) { ## data is structured as subjects in rows, genes in columns ## (and thus is transposed inside gene.select to be fed to mt.maxT). ## If you want leave-one-out, set knumber =3D NULL ## size is the number of genes used when building the classifier. ## those are the "best" genes, based on a t-statistic. =20 ## First, selecting the data subsets for cross-validation if(is.null(knumber)) { knumber <- length(y) leave.one.out <- TRUE } else leave.one.out <- FALSE N <- length(y) if(knumber > N) stop(message =3D "knumber has to be <=3D number of su= bjects") reps <- floor(N/knumber) reps.last <- N - (knumber-1)*reps index.select <- c( rep(seq(from =3D 1, to =3D (knumber - 1)), reps), rep(knumber, reps.last)) index.select <- sample(index.select) =20 cv.errors <- matrix(-99, nrow =3D knumber, ncol =3D 4) ## Fit model for each data set. =20 for(sample.number in 1:knumber) { ## gene selection gene.subset <- gene.select(data[index.select !=3D sample.number, = ], y[index.select !=3D sample.number], size =3D size) =20 =20 ## predict from svm on that subset y.f <- factor(y) test.set <- data[index.select =3D=3D sample.number, gene.subset] if(is.null(dim(test.set))) test.set <- matrix(test.set, nrow =3D = 1) ##=20 for leave-one-out predicted <- predict(svm(data[index.select !=3D sample.number,=20 gene.subset], y.f[index.select !=3D sample.number], kernel =3D "linear"), test.set) =20 cv.errors[sample.number, 1] <- length(which((y.f[index.select =3D= =3D=20 sample.number] =3D=3D 1) & predicted =3D=3D 0)= ) cv.errors[sample.number, 2] <- length(which((y.f[index.select =3D= =3D=20 sample.number] =3D=3D 0) & predicted =3D=3D 1)= ) cv.errors[sample.number, 3] <- length(which(y.f[index.select =3D=3D= =20 sample.number] !=3D predicted)) cv.errors[sample.number, 4] <- length(predicted) =20 } cv.errors <- data.frame(cv.errors) names(cv.errors) <- c("true.1.pred.0", "true.0.pred.1", "total.error"= ,=20 "number.predicted") average.error.rate <- sum(cv.errors[, 3])/sum(cv.errors[, 4]) return(list(cv.errors =3D cv.errors, average.error.rate =3D=20 average.error.rate)) =20 } ## An example code: cross.valid.svm(matrix.covar, class.vector, k =3D 10, size =3D 10) --=20 Ram=F3n D=EDaz-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncol=F3gicas (CNIO) (Spanish National Cancer Center) Melchor Fern=E1ndez Almagro, 3 28029 Madrid (Spain) http://bioinfo.cnio.es/~rdiaz --------------Boundary-00=_RIKDOLK72K5YKH5275KI Content-Type: text/plain; charset="iso-8859-1"; name="cv-svm.R" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="cv-svm.R" cross.valid.optim.svm <- function(data, y, knumber = 10, size = 200) { ## First, selecting the data subsets for cross-validation ## data is structured as subjects in rows, genes in columns ## (and thus is transposed inside gene.select to be fed to mt.maxT). ## If you want leave-one-out, set knumber = NULL ## size is the number of genes used when building the classifier. ## those are the "best" genes, based on a t-statistic. ## gene selection gene.subset <- gene.select(data, y, size = size) if(is.null(knumber)) { knumber <- length(y) leave.one.out <- TRUE } else leave.one.out <- FALSE N <- length(y) if(knumber > N) stop(message = "knumber has to be <= number of subjects") reps <- floor(N/knumber) reps.last <- N - (knumber-1)*reps index.select <- c( rep(seq(from = 1, to = (knumber - 1)), reps), rep(knumber, reps.last)) index.select <- sample(index.select) cv.errors <- matrix(-99, nrow = knumber, ncol = 4) ## Fit model for each data set. for(sample.number in 1:knumber) { ## predict from svm on that subset y.f <- factor(y) test.set <- data[index.select == sample.number, gene.subset] if(is.null(dim(test.set))) test.set <- matrix(test.set, nrow = 1) ## for leave-one-out predicted <- predict(svm(data[index.select != sample.number, gene.subset], y.f[index.select != sample.number], kernel = "linear"), test.set) cv.errors[sample.number, 1] <- length(which((y.f[index.select == sample.number] == 1) & predicted == 0)) cv.errors[sample.number, 2] <- length(which((y.f[index.select == sample.number] == 0) & predicted == 1)) cv.errors[sample.number, 3] <- length(which(y.f[index.select == sample.number] != predicted)) cv.errors[sample.number, 4] <- length(predicted) } cv.errors <- data.frame(cv.errors) names(cv.errors) <- c("true.1.pred.0", "true.0.pred.1", "total.error", "number.predicted") average.error.rate <- sum(cv.errors[, 3])/sum(cv.errors[, 4]) return(list(cv.errors = cv.errors, average.error.rate = average.error.rate)) } gene.select <- function(data, class, size = NULL, threshold = NULL) { # t.stat <- apply(data, 2, function(x) {abs(t.test(x ~ class)$statistic)}) this is slower than tmp <- mt.maxT(t(data), class, B= 1)[, c(1, 2)] selected <- tmp[seq(1:size), 1] return(selected) } cross.valid.optim.svm(matrix.data, censored, k = 10, size = 10) --------------Boundary-00=_RIKDOLK72K5YKH5275KI--

ADD REPLY • link 23.1 years ago Ramon Diaz ★ 1.1k

0

Entering edit mode

Stephen Henderson ★ 1.0k

@stephen-henderson-71

Last seen 8.8 years ago

No not before you start but after each fold, so that each training round uses a slightly different set of genes/features. -----Original Message----- From: Robert Gentleman [mailto:rgentlem@jimmy.harvard.edu] Sent: Friday, January 24, 2003 1:16 PM To: Stephen Henderson Subject: Re: [BioC] How to do k-fold validation using SVM On Fri, Jan 24, 2003 at 01:08:23PM -0000, Stephen Henderson wrote: > Is there a simple??? way to do a gene/feature selection for each round of > cross validation-- using the ipred errorest function? > do you mean take a subset before you start? there is a whole package called genefilter that does all sorts of things in that regard? robert > I do not mean select some set of genes and then do a cv on this subset, but > rather to reselect the subset for each fold? > > I had written a rather long winded loop previous to this posting (had missed > ipred) but now wonder if there is a shortcut? > > -----Original Message----- > From: Torsten Hothorn [mailto:Torsten.Hothorn@rzmail.uni- erlangen.de] > Sent: Friday, January 24, 2003 7:13 AM > To: Adaikalavan Ramasamy > Cc: Song, Guangchun; bioconductor@stat.math.ethz.ch > Subject: RE: [BioC] How to do k-fold validation using SVM > > On Fri, 24 Jan 2003, Adaikalavan Ramasamy wrote: > > > You might want to use the function svm() in the e1071 library with the > > option 'cross'. > > > > Or you can manually break the dataset into k subsets and write a loop. > > This might be better if you prefer to do stratified sampling for the > > fold rather than random sampling. > > > > or you can use the "errorest" function in the ipred-package (see R News > 2(2) for examples) > > Torsten > > > -----Original Message----- > > From: Song, Guangchun [mailto:Guangchun.Song@stjude.org] > > Sent: Friday, January 24, 2003 7:35 AM > > To: bioconductor@stat.math.ethz.ch > > Subject: [BioC] How to do k-fold validation using SVM > > > > > > > > Did anyone know how to do the k-fold validation on the training data set > > by SVM? > > > > Thanks. > > > > > > Guangchun > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > > ********************************************************************** > This email and any files transmitted with it are confidential an ... [[dropped]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > http://www.stat.math.ethz.ch/mailman/listinfo/bioconductor -- +--------------------------------------------------------------------- ------ + | Robert Gentleman phone : (617) 632-5250 | | Associate Professor fax: (617) 632-2444 | | Department of Biostatistics office: M1B20 | Harvard School of Public Health email: rgentlem@jimmy.dfci.harvard.edu | +--------------------------------------------------------------------- ------ +

ADD COMMENT • link 23.1 years ago Stephen Henderson ★ 1.0k

0

Entering edit mode

Stephen Henderson ★ 1.0k

@stephen-henderson-71

Last seen 8.8 years ago

That looks good--thanks A question and a suggestion follow. 1. You say you used a linear kernel--Did you find this to be best after testing and/or optimising the other kernels? 2. A set of wrapper functions (for multtest, ipred, and e1071) that consistently interface the affy data object with a few good feature selection methods and classification methods might be a useful new addition to BioC. Stephen Henderson -----Original Message----- From: Ramon Diaz [mailto:rdiaz@cnio.es] Sent: Monday, January 27, 2003 1:38 PM To: bioconductor@stat.math.ethz.ch Cc: amateos@cnio.es Subject: Re: [BioC] How to do k-fold validation using SVM Sorry for jumping into the thread so late: I just read the posting today. Anyway, I have used the following code a couple of times, and maybe it is of some help to you. In each "round" (for each set of training data) I select the best K genes (where best means with largest abs. value of t statistic) and then fit the svm using those K genes. For a couple of reasons, I use mt.maxT (from the multtest library) to get the t statistic, but you can modify the function at your convenience (like an ANOVA for > 3 or whatever you want). Note also that I use a linear kernel. Hope it helps, Ramón gene.select <- function(data, class, size = NULL, threshold = NULL) { # t.stat <- apply(data, 2, function(x) {abs(t.test(x ~ class)$statistic)}) this is slower than tmp <- mt.maxT(t(data), class, B= 1)[, c(1, 2)] selected <- tmp[seq(1:size), 1] return(selected) } cross.valid.svm <- function(data, y, knumber = 10, size = 200) { ## data is structured as subjects in rows, genes in columns ## (and thus is transposed inside gene.select to be fed to mt.maxT). ## If you want leave-one-out, set knumber = NULL ## size is the number of genes used when building the classifier. ## those are the "best" genes, based on a t-statistic. ## First, selecting the data subsets for cross-validation if(is.null(knumber)) { knumber <- length(y) leave.one.out <- TRUE } else leave.one.out <- FALSE N <- length(y) if(knumber > N) stop(message = "knumber has to be <= number of subjects") reps <- floor(N/knumber) reps.last <- N - (knumber-1)*reps index.select <- c( rep(seq(from = 1, to = (knumber - 1)), reps), rep(knumber, reps.last)) index.select <- sample(index.select) cv.errors <- matrix(-99, nrow = knumber, ncol = 4) ## Fit model for each data set. for(sample.number in 1:knumber) { ## gene selection gene.subset <- gene.select(data[index.select != sample.number, ], y[index.select != sample.number], size = size) ## predict from svm on that subset y.f <- factor(y) test.set <- data[index.select == sample.number, gene.subset] if(is.null(dim(test.set))) test.set <- matrix(test.set, nrow = 1) ## for leave-one-out predicted <- predict(svm(data[index.select != sample.number, gene.subset], y.f[index.select != sample.number], kernel = "linear"), test.set) cv.errors[sample.number, 1] <- length(which((y.f[index.select == sample.number] == 1) & predicted == 0)) cv.errors[sample.number, 2] <- length(which((y.f[index.select == sample.number] == 0) & predicted == 1)) cv.errors[sample.number, 3] <- length(which(y.f[index.select == sample.number] != predicted)) cv.errors[sample.number, 4] <- length(predicted) } cv.errors <- data.frame(cv.errors) names(cv.errors) <- c("true.1.pred.0", "true.0.pred.1", "total.error", "number.predicted") average.error.rate <- sum(cv.errors[, 3])/sum(cv.errors[, 4]) return(list(cv.errors = cv.errors, average.error.rate = average.error.rate)) } ## An example code: cross.valid.svm(matrix.covar, class.vector, k = 10, size = 10) -- Ramón Díaz-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncológicas (CNIO) (Spanish National Cancer Center) Melchor Fernández Almagro, 3 28029 Madrid (Spain) http://bioinfo.cnio.es/~rdiaz ********************************************************************** This email and any files transmitted with it are confidential an ... [[dropped]]

ADD COMMENT • link 23.1 years ago Stephen Henderson ★ 1.0k

0

Entering edit mode

On Monday 27 January 2003 14:55, Stephen Henderson wrote: > That looks good--thanks > A question and a suggestion follow. > > 1. You say you used a linear kernel--Did you find this to be best after > testing and/or optimising the other kernels? Not really. At least, not after extensive testing. But with the data I am working linear seemed better than radial, and the people I am working with preferred linear than radial (and I do too: I don't really understand linear SVMs, much less other kernels). > > 2. A set of wrapper functions (for multtest, ipred, and e1071) that > consistently interface the affy data object with a few good feature > selection methods and classification methods might be a useful new addition > to BioC. I guess that is an invitation for me to write that. I actually like the suggestion (and I think it should be fairly easy to get my boss to regard it as a good idea too). So I'll probably go for it (but will be a few months before I can do it). [A disclaimer, though: I am not a particularly gifted R programmer; for example, I've managed to avoid learning anything about S4 classes, and have no idea how Sweave works; I suppose this is an opportunity to learn some of this stuff] Best, > > Stephen Henderson > > > -----Original Message----- > From: Ramon Diaz [mailto:rdiaz@cnio.es] > Sent: Monday, January 27, 2003 1:38 PM > To: bioconductor@stat.math.ethz.ch > Cc: amateos@cnio.es > Subject: Re: [BioC] How to do k-fold validation using SVM > > Sorry for jumping into the thread so late: I just read the posting today. > Anyway, I have used the following code a couple of times, and maybe it is > of > > some help to you. In each "round" (for each set of training data) I select > the best K genes (where best means with largest abs. value of t statistic) > and then fit the svm using those K genes. > > > For a couple of reasons, I use mt.maxT (from the multtest library) to get > the > t statistic, but you can modify the function at your convenience (like an > ANOVA for > 3 or whatever you want). Note also that I use a linear kernel. > > Hope it helps, > > Ramón > > gene.select <- function(data, class, size = NULL, threshold = NULL) { > # t.stat <- apply(data, 2, function(x) {abs(t.test(x ~ > class)$statistic)}) > this is slower than > tmp <- mt.maxT(t(data), class, B= 1)[, c(1, 2)] > selected <- tmp[seq(1:size), 1] > return(selected) > } > > > cross.valid.svm <- function(data, y, knumber = 10, size = 200) { > ## data is structured as subjects in rows, genes in columns > ## (and thus is transposed inside gene.select to be fed to mt.maxT). > > ## If you want leave-one-out, set knumber = NULL > ## size is the number of genes used when building the classifier. > ## those are the "best" genes, based on a t-statistic. > > > ## First, selecting the data subsets for cross-validation > if(is.null(knumber)) { > knumber <- length(y) > leave.one.out <- TRUE > } > else leave.one.out <- FALSE > N <- length(y) > if(knumber > N) stop(message = "knumber has to be <= number of > subjects") > reps <- floor(N/knumber) > reps.last <- N - (knumber-1)*reps > index.select <- c( rep(seq(from = 1, to = (knumber - 1)), reps), > rep(knumber, reps.last)) > index.select <- sample(index.select) > > cv.errors <- matrix(-99, nrow = knumber, ncol = 4) > > ## Fit model for each data set. > for(sample.number in 1:knumber) { > ## gene selection > gene.subset <- gene.select(data[index.select != sample.number, ], > y[index.select != sample.number], > size = size) > > > ## predict from svm on that subset > y.f <- factor(y) > test.set <- data[index.select == sample.number, gene.subset] > if(is.null(dim(test.set))) test.set <- matrix(test.set, nrow = 1) > ## > > for leave-one-out > predicted <- predict(svm(data[index.select != sample.number, > gene.subset], > y.f[index.select != sample.number], > kernel = "linear"), test.set) > > cv.errors[sample.number, 1] <- length(which((y.f[index.select == > sample.number] == 1) > & predicted == 0)) > cv.errors[sample.number, 2] <- length(which((y.f[index.select == > sample.number] == 0) > & predicted == 1)) > cv.errors[sample.number, 3] <- length(which(y.f[index.select == > sample.number] != predicted)) cv.errors[sample.number, 4] <- > length(predicted) > > } > cv.errors <- data.frame(cv.errors) > names(cv.errors) <- c("true.1.pred.0", "true.0.pred.1", "total.error", > "number.predicted") average.error.rate <- sum(cv.errors[, > 3])/sum(cv.errors[, 4]) > return(list(cv.errors = cv.errors, average.error.rate = > average.error.rate)) > > } > > > ## An example code: > cross.valid.svm(matrix.covar, class.vector, k = 10, size = 10) -- Ramón Díaz-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncológicas (CNIO) (Spanish National Cancer Center) Melchor Fernández Almagro, 3 28029 Madrid (Spain) http://bioinfo.cnio.es/~rdiaz

ADD REPLY • link 23.1 years ago Ramon Diaz ★ 1.1k

Login before adding your answer.