Which one is more trustable for Cox Proportional Hazard Model with Lasso penalty? LOOCV or K-fold cross-validation?
1
0
Entering edit mode
Talip Zengin ▴ 10
@talip-zengin-14290
Last seen 11 months ago
Mugla, Turkiye

Hi, When I checked articles to find a method to identify predictors of overall survival, mostly glmnet cox model is used with Lasso penalty and cross validation. As cross validation, people used 10-fold cross validation with 1000 iterations or LOOCV.

I read about glmnet method and searched forums for codes. I found two different approaches for 10-fold cross validation: first one is running final glmnet function by using lambda.min value from cv.glmnet results, to create final model; second is (from an article) to find frequency of active covariates from results of cv.glmnet command with 1000 iterations and decide most frequent model as final model. Moreover, I understood that if we adjust nfolds option of cv.glmnet command as number of rows (samples) of response vector (y), it performs LOOCV. For LOOCV, we do not need iterations. If there is any misunderstood part, please correct me.

I tried the codes below and I had close but different results. Therefore I wonder which approach is more trustable, in this case? Note: Approach 3 (LOOCV) takes shorter time than others although disadvantage of LOOCV is time consuming. Is it because of iterations?

library(glmnet)
library(dplyr)

data(CoxExample)
x <- x[1:500,]
y <- y[1:500,]

## Approach 1: 10-fold cross validation with 1000 iterations, then final glmnet run

lambdas = NULL
for (i in 1:1000) {
fit <- cv.glmnet(x,y,family = "cox", nfolds = 10, alpha = 1, grouped = TRUE)
errors = data.frame(fit$lambda,fit$cvm)
lambdas <- rbind(lambdas,errors)
}

# take mean cvm for each lambda
lambdas <- aggregate(lambdas[, 2], list(lambdas$fit.lambda), mean) # select the best one bestindex = which(lambdas[2]==min(lambdas[2])) bestlambda = lambdas[bestindex,1] # and now run glmnet once more with it fit <- glmnet(x,y,family = "cox", alpha = 1, lambda=bestlambda) coef.min = coef(fit, s = "lambda.min") active.min = which(coef.min != 0) active.min ## Approach 2: 10-fold cross validation with 1000 iterations, then finding frequency of active covariates coefs <- NULL active.mins <- list() for (i in 1:1000){ cvfit = cv.glmnet(x, y, family = "cox", nfolds = 10, alpha = 1, grouped = TRUE) coef.min = coef(cvfit, s = "lambda.min") coefs <- cbind(coefs, coef.min) active.min = which(coef.min != 0) active.mins <- c(active.mins, list(active.min)) } table(unlist(lapply(active.mins, paste, collapse = " "))) ## Approach 3: LOOCV through nrow-folds cvfit <- cv.glmnet(x, y, family = "cox", nfolds = nrow(y), alpha = 1, grouped = TRUE) cv.coef.min <- coef(cvfit, s = "lambda.min") cv.active.min <- which(cv.coef.min != 0) cv.active.min fit <- glmnet(x, y, family = "cox", alpha = 1, lambda = cvfit$lambda.min)
coef.min <- coef(fit, s = "lambda.min")
active.min <- which(coef.min != 0)
active.min


Results Approach 1:

[1] 1 2 3 4 5 6 7 8 9 10 13 17 18 24 27 29

Approach 2:

[1] 1 2 3 4 5 6 7 8 9 10 13 17 18 24 27 29 (frequency: 966/1000)

Approach 3:

[1] 1 2 3 4 5 6 7 8 9 10 13 14 17 18 24 27 29

glmnet cox LOOCV k-fold cross validation • 1.2k views
1
Entering edit mode
chris86 ▴ 400
@chris86-8408
Last seen 3.0 years ago
UCL, United Kingdom

Actually (unless this is just for training, then excuse me) none of this is trustworthy because you are not splitting into separate training and testing datasets, e.g.

https://stackoverflow.com/questions/18130338/plotting-an-roc-curve-in-glmnet

If you want to do cross validation + glmnet just use caret with all the data to do it auto, e.g.

ctrl <- trainControl(method = 'cv', summaryFunction=twoClassSummary, classProbs=T, savePredictions = T, verboseIter = T)

fit2 <- train(as.formula( paste( 'labelsa', '~', '.' ) ), data=test3, method="glmnet", trControl=ctrl, metric = "ROC")

Then fit2 contains your mean AUC.

If this is just for training (your finding the best model on this data for prediction), approach 1 sounds reasonable to me. You can also use caret for finding the best model and pass it a grid of lambda and alpha values.