undefined columns selected error when using bagging{ipred}
1
0
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 10.4 years ago
Dear All, i'm trying to reproduce the results of the survival analysis in Capter 17, p.307 of "Bioinformatics and Computational Biology Solutions using R and Bioconductor" using the code chunks from http://www.bioconductor.org/help/publications/books/bioinformatics- and-computational-biology-solutions/chapter- code/Computational_Inference.R The call to the bagging function throws an error, although i decreased the amount of variables selected to p=25 (so the model fit wouldn't be over-determined). The code is below. Thanks a lot, Constanze > library("exactRankTests") Package ???exactRankTests??? is no longer under development. Please consider using package ???coin??? instead. > # library("coin") > library("ipred") Lade n??tiges Paket: rpart Lade n??tiges Paket: MASS Lade n??tiges Paket: mlbench Lade n??tiges Paket: nnet Lade n??tiges Paket: class > library("kidpack") *** Deprecation warning ***: The package 'kidpack' is deprecated and will not be supported after Bioconductor release 2.1. > data(eset) > var_selection <- function(indx, expressions, response, p = 100) { + + y <- switch(class(response), + "factor" = { model.matrix(~ response - 1)[indx, ,drop = FALSE] }, + "Surv" = { matrix(cscores(response[indx]), ncol = 1) }, + "numeric" = { matrix(rank(response[indx]), ncol = 1) } + ) + + x <- expressions[,indx, drop = FALSE] + n <- nrow(y) + linstat <- x %*% y + Ey <- matrix(colMeans(y), nrow = 1) + Vy <- matrix(rowMeans((t(y) - as.vector(Ey))^2), nrow = 1) + + rSx <- matrix(rowSums(x), ncol = 1) + rSx2 <- matrix(rowSums(x^2), ncol = 1) + E <- rSx %*% Ey + V <- n / (n - 1) * kronecker(Vy, rSx2) + V <- V - 1 / (n - 1) * kronecker(Vy, rSx^2) + + stats <- abs(linstat - E) / sqrt(V) + stats <- do.call("pmax", as.data.frame(stats)) + return(which(stats > sort(stats)[length(stats) - p])) + } > > > remove <- is.na(eset$survival.time) > seset <- eset[,!remove] > response <- Surv(seset$survival.time, seset$died) > response[response[,1] == 0] <- 1 > expressions <- t(apply(exprs(seset), 1, rank)) > exprDF <- as.data.frame(t(expressions)) > > I <- nrow(exprDF) > Iindx <- 1:I > selected <- var_selection(Iindx, expressions, response,p=25) > bagg <- bagging(response ~., data = exprDF[,selected],ntrees = 100) Fehler in `[.data.frame`(m, attr(Terms, "term.labels")) : undefined columns selected -- output of sessionInfo(): R version 2.15.1 (2012-06-22) Platform: i486-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=de_DE.utf8 LC_NUMERIC=C [3] LC_TIME=de_DE.utf8 LC_COLLATE=de_DE.utf8 [5] LC_MONETARY=de_DE.utf8 LC_MESSAGES=de_DE.utf8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=de_DE.utf8 LC_IDENTIFICATION=C attached base packages: [1] splines stats graphics grDevices utils datasets methods [8] base other attached packages: [1] kidpack_1.5.10 ipred_0.8-8 class_7.3-4 [4] nnet_7.3-4 mlbench_2.1-1 MASS_7.3-21 [7] rpart_3.1-54 exactRankTests_0.8-22 affy_1.26.0 [10] Biobase_2.8.0 survival_2.36-14 loaded via a namespace (and not attached): [1] affyio_1.16.0 preprocessCore_1.10.0 tools_2.15.1 -- Sent via the guest posting facility at bioconductor.org.
Survival Survival • 3.5k views
ADD COMMENT
0
Entering edit mode
@valerie-obenchain-4275
Last seen 3.0 years ago
United States
Hi Constanze, The problems appears to be with how bagging() deals with the column names of the sample data frame. The immediate solution is to change the column names to non-numbers, > bagg <- bagging(response ~., data = exprDF[,selected], ntrees = 100) Error in `[.data.frame`(m, attr(Terms, "term.labels")) : undefined columns selected > dat <- exprDF[,selected] > colnames(dat) <- paste0("A", 1:ncol(dat)) > bagg <- bagging(response ~., data = dat, ntrees = 100) > bagg Bagging survival trees with 25 bootstrap replications Call: bagging.data.frame(formula = response ~ ., data = df, ntrees = 100) As you've seen from error messages as you've worked through these examples, several packages are no longer maintained and many functions have evolved since the book was written. ipred is currently maintained and it is the package that bagging() comes from. I'm cc'ing the maintainer because this issue may be a bug. Hi Torsten, It looks like bagging() does not like colnames that are numeric coerced to character. Using an modified example from ?bagging, data(DLBCL) ## first example works fine mod <- bagging(Surv(time,cens) ~ ., data=DLBCL, coob=TRUE) ## change the column names of the data.frame names(DLBCL) <- c("DLCL.Sample", "Gene.Expression", "time", "cens", "IPI", 1:10) > names(DLBCL) [1] "DLCL.Sample" "Gene.Expression" "time" "cens" [5] "IPI" "1" "2" "3" [9] "4" "5" "6" "7" [13] "8" "9" "10" > mod <- bagging(Surv(time,cens) ~ ., data=DLBCL, coob=TRUE) Error in `[.data.frame`(m, attr(Terms, "term.labels")) : undefined columns selected The error is thrown from this line in the irpart() function, isord <- unlist(lapply(m[attr(Terms, "term.labels")], tfun)) When the 'Terms' variable is created, the term labels are created with an extra backslash "`" which prevents them from being matched to the column names of the data.frame (m), debugging in: irpart(y ~ ., data = mydata, control = control, bcontrol = list(nbagg = nbagg, ns = ns, replace = REPLACE)) ... Browse[2]> debug: Terms <- attr(m, "terms") ... Browse[2]> attr(Terms, "term.labels") [1] "DLCL.Sample" "Gene.Expression" "IPI" "`1`" [5] "`2`" "`3`" "`4`" "`5`" [9] "`6`" "`7`" "`8`" "`9`" [13] "`10`" ... Browse[2]> colnames(m) [1] "y" "DLCL.Sample" "Gene.Expression" "IPI" [5] "1" "2" "3" "4" [9] "5" "6" "7" "8" [13] "9" "10" Valerie On 09/05/12 08:21, Constanze [guest] wrote: > Dear All, > > i'm trying to reproduce the results of the survival analysis in Capter 17, p.307 of "Bioinformatics and Computational Biology Solutions using R and Bioconductor" using the code chunks from http://www.bioconductor.org/help/publications/books/bioinformatics- and-computational-biology-solutions/chapter- code/Computational_Inference.R > The call to the bagging function throws an error, although i decreased the amount of variables selected to p=25 (so the model fit wouldn't be over-determined). The code is below. > > Thanks a lot, > > Constanze > > >> library("exactRankTests") > Package ???exactRankTests??? is no longer under development. > Please consider using package ???coin??? instead. > >> # library("coin") >> library("ipred") > Lade n??tiges Paket: rpart > Lade n??tiges Paket: MASS > Lade n??tiges Paket: mlbench > Lade n??tiges Paket: nnet > Lade n??tiges Paket: class >> library("kidpack") > *** Deprecation warning ***: > The package 'kidpack' is deprecated and will not be supported after Bioconductor release 2.1. > > >> data(eset) >> var_selection<- function(indx, expressions, response, p = 100) { > + > + y<- switch(class(response), > + "factor" = { model.matrix(~ response - 1)[indx, ,drop = FALSE] }, > + "Surv" = { matrix(cscores(response[indx]), ncol = 1) }, > + "numeric" = { matrix(rank(response[indx]), ncol = 1) } > + ) > + > + x<- expressions[,indx, drop = FALSE] > + n<- nrow(y) > + linstat<- x %*% y > + Ey<- matrix(colMeans(y), nrow = 1) > + Vy<- matrix(rowMeans((t(y) - as.vector(Ey))^2), nrow = 1) > + > + rSx<- matrix(rowSums(x), ncol = 1) > + rSx2<- matrix(rowSums(x^2), ncol = 1) > + E<- rSx %*% Ey > + V<- n / (n - 1) * kronecker(Vy, rSx2) > + V<- V - 1 / (n - 1) * kronecker(Vy, rSx^2) > + > + stats<- abs(linstat - E) / sqrt(V) > + stats<- do.call("pmax", as.data.frame(stats)) > + return(which(stats> sort(stats)[length(stats) - p])) > + } >> >> remove<- is.na(eset$survival.time) >> seset<- eset[,!remove] >> response<- Surv(seset$survival.time, seset$died) >> response[response[,1] == 0]<- 1 >> expressions<- t(apply(exprs(seset), 1, rank)) >> exprDF<- as.data.frame(t(expressions)) >> >> I<- nrow(exprDF) >> Iindx<- 1:I >> selected<- var_selection(Iindx, expressions, response,p=25) >> bagg<- bagging(response ~., data = exprDF[,selected],ntrees = 100) > Fehler in `[.data.frame`(m, attr(Terms, "term.labels")) : > undefined columns selected > > > -- output of sessionInfo(): > > R version 2.15.1 (2012-06-22) > Platform: i486-pc-linux-gnu (32-bit) > > locale: > [1] LC_CTYPE=de_DE.utf8 LC_NUMERIC=C > [3] LC_TIME=de_DE.utf8 LC_COLLATE=de_DE.utf8 > [5] LC_MONETARY=de_DE.utf8 LC_MESSAGES=de_DE.utf8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=de_DE.utf8 LC_IDENTIFICATION=C > > attached base packages: > [1] splines stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] kidpack_1.5.10 ipred_0.8-8 class_7.3-4 > [4] nnet_7.3-4 mlbench_2.1-1 MASS_7.3-21 > [7] rpart_3.1-54 exactRankTests_0.8-22 affy_1.26.0 > [10] Biobase_2.8.0 survival_2.36-14 > > loaded via a namespace (and not attached): > [1] affyio_1.16.0 preprocessCore_1.10.0 tools_2.15.1 > > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT

Login before adding your answer.

Traffic: 595 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6