Model variable optimisation

0

Entering edit mode

Tony McBryan ▴ 60

@tony-mcbryan-3847

Last seen 9.6 years ago

Hello list, I'm using limma to analyse some agilent microarrays for an experiment that had a large number of variables and I'm running into a wall trying to decide on the best model to use. Briefly (and where R commands have been abbreviated for clarity); I'm using the AgiMicroRna package, reading in the arrays and the target file. Background correcting by normexp + offset and then quantile normalisation and converting to log values. Then I create a model using model.matrix and fit it using lmFit. The bit I'm having trouble with is deciding the appropriate model that I should use. We have 36 arrays in total. Split evenly by gender (m/f), age (young/old), treatment (a,b,c) as well as three hybridisation batches, three experimental batches, and 12 cell strains. (Targets.txt pasted below for reference). If I choose a model as such: groups<- model.matrix(~treatment + young.old + hyb + batch + sex) The results seem to broadly make sense from a biological standpoint. However, if I use: groups<- model.matrix(~young.old + sex + treatment + hyb + batch + strain + treatment:young.old + young.old:sex + treatment:sex + batch:young.old) which would include all the variables plus interaction effects then the results make less sense and generally don't overlap with the results of the first model. Thus, my suspicion would be that the latter model is too complex and is overfitting. Sensible? Looking at the R-Squared stats for the fit of the larger model: rsquared<- function(fit, data) { sst<-rowSums(data^2) ssr<-sst-fit$df.residual*fit$sigma^2 rsquared<- ssr/sst rsquared } > summary(rsquared(fit,data)) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.9762 0.9998 0.9999 0.9998 1.0000 1.0000 Seems to show that the model is fitting to the data virtually 100% which seems quite high to me. Looking at one probe individually supports this too: > summary(lm(dat2[1,]~-1+groups)) Call: lm(formula = dat2[1, ] ~ -1 + groups) <--Snip--> Residual standard error: 0.1418 on 22 degrees of freedom Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999 F-statistic: 1.913e+04 on 14 and 22 DF, p-value:< 2.2e-16 However, when I look at the simpler model: > summary(rsquared(simplerfit,dat2)) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.9645 0.9997 0.9999 0.9997 1.0000 1.0000 The R squared is largely the same; which I would interpret as the additional interaction terms are adding little to the model. Would I be correct in this interpretation? Now, I would want to do something akin to a stepwise regression to determine the appropriate model I should be using: I can do this for a single probe as so: > step(lm(dat2[1,]~young.old + sex + treatment + hyb + batch + strain + treatment:young.old + young.old:sex + treatment:sex + batch:young.old)) Which suggests the correct model to use is: ~young.old + sex + treatment + batch + strain + young.old:sex However, that has obviously been computed for just a single probe and I'll get different results if I use a different probe as the basis for the model. I can't find any obvious way to do a similar analysis for the fit produced by lmFit to optimise the model for the entire set of probes. (i.e. find the best overall model). TL;DR: Broadly I'm looking for a way to minimise AIC (or another similar goodness of fit metric) by model selection for a produced by lmFit. Can anyone offer any suggestions on approaches which can do this? Tony McBryan --- targets.txt > FileName hyb strain treatment batch sex young-old > 40398_2.txt 1 S030 Ctrl 1 m old > 40398_3.txt 1 S030 Rot 3d 1 m old > 40398_4.txt 1 S030 Rot 3h 1 m old > 40399_2.txt 1 JS14 Rot 3d 1 m young > 40399_3.txt 1 JS14 Ctrl 1 m young > 40399_4.txt 1 JS14 Rot 3h 1 m young > 40400_2.txt 1 S175 Ctrl 1 f old > 40400_3.txt 1 S175 Rot 3d 1 f old > 40400_4.txt 1 S175 Rot 3h 1 f old > 40401_1.txt 1 JS15 Rot 3d 1 f young > 40401_2.txt 1 JS15 Rot 3h 1 f young > 40401_3.txt 1 JS15 Ctrl 1 f young > 40396_2.txt 2 S386 Rot 3h 1 m old > 40396_3.txt 2 S386 Rot 3d 1 m old > 40396_4.txt 2 S386 Ctrl 1 m old > 40397_1.txt 2 JS20 Ctrl 1 m young > 40397_2.txt 2 JS20 Rot 3h 1 m young > 40397_4.txt 2 JS20 Rot 3d 1 m young > 40415_2.txt 2 S569 Ctrl 3 f old > 40415_3.txt 2 S569 Rot 3d 3 f old > 40415_4.txt 2 S569 Rot 3h 3 f old > 40437_2.txt 2 JS10 Rot 3d 2 f young > 40437_3.txt 2 JS10 Rot 3h 2 f young > 40437_4.txt 2 JS10 Ctrl 2 f young > 40402_2.txt 3 S591 Rot 3d 2 m old > 40402_3.txt 3 S591 Ctrl 2 m old > 40402_4.txt 3 S591 Rot 3h 2 m old > 40403_2.txt 3 JS30 Rot 3d 2 m young > 40403_3.txt 3 JS30 Rot 3h 2 m young > 40403_4.txt 3 JS30 Ctrl 2 m young > 40404_2.txt 3 S534 Rot 3h 3 f old > 40404_3.txt 3 S534 Ctrl 3 f old > 40404_4.txt 3 S534 Rot 3d 3 f old > 40414_2.txt 3 JS05 Rot 3h 3 f young > 40414_3.txt 3 JS05 Rot 3d 3 f young > 40414_4.txt 3 JS05 Ctrl 3 f young [[alternative HTML version deleted]]

Regression probe limma AgiMicroRna Regression probe limma AgiMicroRna • 1.3k views

ADD COMMENT • link updated 13.8 years ago by Steve Shen ▴ 330 • written 13.8 years ago by Tony McBryan ▴ 60

0

Entering edit mode

Steve Shen ▴ 330

@steve-shen-3743

Last seen 9.6 years ago

Hi Tony, Have you tried linear mixed effects model? It may be able to solve your problem. Thanks, Steve On Mon, Jul 12, 2010 at 8:29 AM, Tony McBryan <tony@mcbryan.co.uk> wrote: > Hello list, > > I'm using limma to analyse some agilent microarrays for an experiment > that had a large number of variables and I'm running into a wall trying > to decide on the best model to use. > > Briefly (and where R commands have been abbreviated for clarity); > > I'm using the AgiMicroRna package, reading in the arrays and the target > file. Background correcting by normexp + offset and then quantile > normalisation and converting to log values. Then I create a model using > model.matrix and fit it using lmFit. > > The bit I'm having trouble with is deciding the appropriate model that I > should use. We have 36 arrays in total. Split evenly by gender (m/f), > age (young/old), treatment (a,b,c) as well as three hybridisation > batches, three experimental batches, and 12 cell strains. (Targets.txt > pasted below for reference). > > If I choose a model as such: > > groups<- model.matrix(~treatment + young.old + hyb + batch + sex) > > The results seem to broadly make sense from a biological standpoint. > However, if I use: > > groups<- model.matrix(~young.old + sex + treatment + hyb + batch + strain + > treatment:young.old + young.old:sex + treatment:sex + batch:young.old) > > which would include all the variables plus interaction effects then the > results make less sense and generally don't overlap with the results of > the first model. > > Thus, my suspicion would be that the latter model is too complex and is > overfitting. Sensible? > > Looking at the R-Squared stats for the fit of the larger model: > > rsquared<- function(fit, data) > { > sst<-rowSums(data^2) > ssr<-sst-fit$df.residual*fit$sigma^2 > rsquared<- ssr/sst > rsquared > } > > > summary(rsquared(fit,data)) > > > Min. 1st Qu. Median Mean 3rd Qu. Max. > > 0.9762 0.9998 0.9999 0.9998 1.0000 1.0000 > > > Seems to show that the model is fitting to the data virtually 100% which > seems quite high to me. > > Looking at one probe individually supports this too: > > > summary(lm(dat2[1,]~-1+groups)) > > Call: > > lm(formula = dat2[1, ] ~ -1 + groups) > > <--Snip--> > > Residual standard error: 0.1418 on 22 degrees of freedom > > Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999 > > F-statistic: 1.913e+04 on 14 and 22 DF, p-value:< 2.2e-16 > > > However, when I look at the simpler model: > > > summary(rsquared(simplerfit,dat2)) > > Min. 1st Qu. Median Mean 3rd Qu. Max. > > 0.9645 0.9997 0.9999 0.9997 1.0000 1.0000 > > > The R squared is largely the same; which I would interpret as the > additional interaction terms are adding little to the model. Would I be > correct in this interpretation? > > Now, I would want to do something akin to a stepwise regression to > determine the appropriate model I should be using: > > I can do this for a single probe as so: > > > step(lm(dat2[1,]~young.old + sex + treatment + hyb + batch + strain + > treatment:young.old + young.old:sex + treatment:sex + batch:young.old)) > > Which suggests the correct model to use is: > > ~young.old + sex + treatment + batch + strain + young.old:sex > > However, that has obviously been computed for just a single probe and > I'll get different results if I use a different probe as the basis for > the model. I can't find any obvious way to do a similar analysis for > the fit produced by lmFit to optimise the model for the entire set of > probes. (i.e. find the best overall model). > > TL;DR: > Broadly I'm looking for a way to minimise AIC (or another similar > goodness of fit metric) by model selection for a produced by lmFit. Can > anyone offer any suggestions on approaches which can do this? > > Tony McBryan > > --- > targets.txt > > FileName hyb strain treatment batch sex young- old > > 40398_2.txt 1 S030 Ctrl 1 m old > > 40398_3.txt 1 S030 Rot 3d 1 m old > > 40398_4.txt 1 S030 Rot 3h 1 m old > > 40399_2.txt 1 JS14 Rot 3d 1 m young > > 40399_3.txt 1 JS14 Ctrl 1 m young > > 40399_4.txt 1 JS14 Rot 3h 1 m young > > 40400_2.txt 1 S175 Ctrl 1 f old > > 40400_3.txt 1 S175 Rot 3d 1 f old > > 40400_4.txt 1 S175 Rot 3h 1 f old > > 40401_1.txt 1 JS15 Rot 3d 1 f young > > 40401_2.txt 1 JS15 Rot 3h 1 f young > > 40401_3.txt 1 JS15 Ctrl 1 f young > > 40396_2.txt 2 S386 Rot 3h 1 m old > > 40396_3.txt 2 S386 Rot 3d 1 m old > > 40396_4.txt 2 S386 Ctrl 1 m old > > 40397_1.txt 2 JS20 Ctrl 1 m young > > 40397_2.txt 2 JS20 Rot 3h 1 m young > > 40397_4.txt 2 JS20 Rot 3d 1 m young > > 40415_2.txt 2 S569 Ctrl 3 f old > > 40415_3.txt 2 S569 Rot 3d 3 f old > > 40415_4.txt 2 S569 Rot 3h 3 f old > > 40437_2.txt 2 JS10 Rot 3d 2 f young > > 40437_3.txt 2 JS10 Rot 3h 2 f young > > 40437_4.txt 2 JS10 Ctrl 2 f young > > 40402_2.txt 3 S591 Rot 3d 2 m old > > 40402_3.txt 3 S591 Ctrl 2 m old > > 40402_4.txt 3 S591 Rot 3h 2 m old > > 40403_2.txt 3 JS30 Rot 3d 2 m young > > 40403_3.txt 3 JS30 Rot 3h 2 m young > > 40403_4.txt 3 JS30 Ctrl 2 m young > > 40404_2.txt 3 S534 Rot 3h 3 f old > > 40404_3.txt 3 S534 Ctrl 3 f old > > 40404_4.txt 3 S534 Rot 3d 3 f old > > 40414_2.txt 3 JS05 Rot 3h 3 f young > > 40414_3.txt 3 JS05 Rot 3d 3 f young > > 40414_4.txt 3 JS05 Ctrl 3 f young > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 13.8 years ago Steve Shen ▴ 330

0

Entering edit mode

Hi Steve (+list), I have not tried that yet as I am unfamiliar with them. I've just now read through some of the documentation for the lme() function in the nlme library but I'm currently struggling to find understandable examples using microarrays etc. Would you be able to provide me with a pointer to an example or a vignette that has more information on how to do this for microarray data or even an example of how it might be used to get me on my way. If you have time would you be able to suggest what the model below might look like when represented as a linear mixed effects model? Thanks, Tony McBryan On 12/07/10 13:45, Steve Shen wrote: > Hi Tony, > > Have you tried linear mixed effects model? It may be able to solve > your problem. Thanks, > > Steve > > On Mon, Jul 12, 2010 at 8:29 AM, Tony McBryan <tony@mcbryan.co.uk> <mailto:tony@mcbryan.co.uk>> wrote: > > Hello list, > > I'm using limma to analyse some agilent microarrays for an experiment > that had a large number of variables and I'm running into a wall > trying > to decide on the best model to use. > > Briefly (and where R commands have been abbreviated for clarity); > > I'm using the AgiMicroRna package, reading in the arrays and the > target > file. Background correcting by normexp + offset and then quantile > normalisation and converting to log values. Then I create a model > using > model.matrix and fit it using lmFit. > > The bit I'm having trouble with is deciding the appropriate model > that I > should use. We have 36 arrays in total. Split evenly by gender > (m/f), > age (young/old), treatment (a,b,c) as well as three hybridisation > batches, three experimental batches, and 12 cell strains. > (Targets.txt > pasted below for reference). > > If I choose a model as such: > > groups<- model.matrix(~treatment + young.old + hyb + batch + sex) > > The results seem to broadly make sense from a biological standpoint. > However, if I use: > > groups<- model.matrix(~young.old + sex + treatment + hyb + batch + > strain + treatment:young.old + young.old:sex + treatment:sex + > batch:young.old) > > which would include all the variables plus interaction effects > then the > results make less sense and generally don't overlap with the > results of > the first model. > > Thus, my suspicion would be that the latter model is too complex > and is > overfitting. Sensible? > > Looking at the R-Squared stats for the fit of the larger model: > > rsquared<- function(fit, data) > { > sst<-rowSums(data^2) > ssr<-sst-fit$df.residual*fit$sigma^2 > rsquared<- ssr/sst > rsquared > } > > > summary(rsquared(fit,data)) > > > Min. 1st Qu. Median Mean 3rd Qu. Max. > > 0.9762 0.9998 0.9999 0.9998 1.0000 1.0000 > > > Seems to show that the model is fitting to the data virtually 100% > which > seems quite high to me. > > Looking at one probe individually supports this too: > > > summary(lm(dat2[1,]~-1+groups)) > > Call: > > lm(formula = dat2[1, ] ~ -1 + groups) > > <--Snip--> > > Residual standard error: 0.1418 on 22 degrees of freedom > > Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999 > > F-statistic: 1.913e+04 on 14 and 22 DF, p-value:< 2.2e-16 > > > However, when I look at the simpler model: > > > summary(rsquared(simplerfit,dat2)) > > Min. 1st Qu. Median Mean 3rd Qu. Max. > > 0.9645 0.9997 0.9999 0.9997 1.0000 1.0000 > > > The R squared is largely the same; which I would interpret as the > additional interaction terms are adding little to the model. > Would I be > correct in this interpretation? > > Now, I would want to do something akin to a stepwise regression to > determine the appropriate model I should be using: > > I can do this for a single probe as so: > > > step(lm(dat2[1,]~young.old + sex + treatment + hyb + batch + > strain + treatment:young.old + young.old:sex + treatment:sex + > batch:young.old)) > > Which suggests the correct model to use is: > > ~young.old + sex + treatment + batch + strain + young.old:sex > > However, that has obviously been computed for just a single probe and > I'll get different results if I use a different probe as the basis for > the model. I can't find any obvious way to do a similar analysis for > the fit produced by lmFit to optimise the model for the entire set of > probes. (i.e. find the best overall model). > > TL;DR: > Broadly I'm looking for a way to minimise AIC (or another similar > goodness of fit metric) by model selection for a produced by > lmFit. Can > anyone offer any suggestions on approaches which can do this? > > Tony McBryan > > --- > targets.txt > > FileName hyb strain treatment batch sex young-old > > 40398_2.txt 1 S030 Ctrl 1 m old > > 40398_3.txt 1 S030 Rot 3d 1 m old > > 40398_4.txt 1 S030 Rot 3h 1 m old > > 40399_2.txt 1 JS14 Rot 3d 1 m young > > 40399_3.txt 1 JS14 Ctrl 1 m young > > 40399_4.txt 1 JS14 Rot 3h 1 m young > > 40400_2.txt 1 S175 Ctrl 1 f old > > 40400_3.txt 1 S175 Rot 3d 1 f old > > 40400_4.txt 1 S175 Rot 3h 1 f old > > 40401_1.txt 1 JS15 Rot 3d 1 f young > > 40401_2.txt 1 JS15 Rot 3h 1 f young > > 40401_3.txt 1 JS15 Ctrl 1 f young > > 40396_2.txt 2 S386 Rot 3h 1 m old > > 40396_3.txt 2 S386 Rot 3d 1 m old > > 40396_4.txt 2 S386 Ctrl 1 m old > > 40397_1.txt 2 JS20 Ctrl 1 m young > > 40397_2.txt 2 JS20 Rot 3h 1 m young > > 40397_4.txt 2 JS20 Rot 3d 1 m young > > 40415_2.txt 2 S569 Ctrl 3 f old > > 40415_3.txt 2 S569 Rot 3d 3 f old > > 40415_4.txt 2 S569 Rot 3h 3 f old > > 40437_2.txt 2 JS10 Rot 3d 2 f young > > 40437_3.txt 2 JS10 Rot 3h 2 f young > > 40437_4.txt 2 JS10 Ctrl 2 f young > > 40402_2.txt 3 S591 Rot 3d 2 m old > > 40402_3.txt 3 S591 Ctrl 2 m old > > 40402_4.txt 3 S591 Rot 3h 2 m old > > 40403_2.txt 3 JS30 Rot 3d 2 m young > > 40403_3.txt 3 JS30 Rot 3h 2 m young > > 40403_4.txt 3 JS30 Ctrl 2 m young > > 40404_2.txt 3 S534 Rot 3h 3 f old > > 40404_3.txt 3 S534 Ctrl 3 f old > > 40404_4.txt 3 S534 Rot 3d 3 f old > > 40414_2.txt 3 JS05 Rot 3h 3 f young > > 40414_3.txt 3 JS05 Rot 3d 3 f young > > 40414_4.txt 3 JS05 Ctrl 3 f young > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch <mailto:bioconductor@stat.math.ethz.ch> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]]

ADD REPLY • link 13.8 years ago Tony McBryan ▴ 60

0

Entering edit mode

Tony, Here is one which I have done a while ago, but your case seems much more complicated. It may also be a good idea to simplify your variables first by doing Surrogate variable analysis. Good luck, -steve lme(expression ~ response * time, random = ~1 | patient, data=d) d <- data.frame(expression = as.vector(data.matrix[i,], mode="numeric"), response = targets$response, time = targets$time, patient = targets$patient) On Mon, Jul 12, 2010 at 9:30 AM, Tony McBryan <tony@mcbryan.co.uk> wrote: > Hi Steve (+list), > > I have not tried that yet as I am unfamiliar with them. I've just now read > through some of the documentation for the lme() function in the nlme library > but I'm currently struggling to find understandable examples using > microarrays etc. > > Would you be able to provide me with a pointer to an example or a vignette > that has more information on how to do this for microarray data or even an > example of how it might be used to get me on my way. > > If you have time would you be able to suggest what the model below might > look like when represented as a linear mixed effects model? > > Thanks, > > Tony McBryan > > > On 12/07/10 13:45, Steve Shen wrote: > > Hi Tony, > > Have you tried linear mixed effects model? It may be able to solve your > problem. Thanks, > > Steve > > On Mon, Jul 12, 2010 at 8:29 AM, Tony McBryan <tony@mcbryan.co.uk> wrote: > >> Hello list, >> >> I'm using limma to analyse some agilent microarrays for an experiment >> that had a large number of variables and I'm running into a wall trying >> to decide on the best model to use. >> >> Briefly (and where R commands have been abbreviated for clarity); >> >> I'm using the AgiMicroRna package, reading in the arrays and the target >> file. Background correcting by normexp + offset and then quantile >> normalisation and converting to log values. Then I create a model using >> model.matrix and fit it using lmFit. >> >> The bit I'm having trouble with is deciding the appropriate model that I >> should use. We have 36 arrays in total. Split evenly by gender (m/f), >> age (young/old), treatment (a,b,c) as well as three hybridisation >> batches, three experimental batches, and 12 cell strains. (Targets.txt >> pasted below for reference). >> >> If I choose a model as such: >> >> groups<- model.matrix(~treatment + young.old + hyb + batch + sex) >> >> The results seem to broadly make sense from a biological standpoint. >> However, if I use: >> >> groups<- model.matrix(~young.old + sex + treatment + hyb + batch + strain >> + treatment:young.old + young.old:sex + treatment:sex + batch:young.old) >> >> which would include all the variables plus interaction effects then the >> results make less sense and generally don't overlap with the results of >> the first model. >> >> Thus, my suspicion would be that the latter model is too complex and is >> overfitting. Sensible? >> >> Looking at the R-Squared stats for the fit of the larger model: >> >> rsquared<- function(fit, data) >> { >> sst<-rowSums(data^2) >> ssr<-sst-fit$df.residual*fit$sigma^2 >> rsquared<- ssr/sst >> rsquared >> } >> >> > summary(rsquared(fit,data)) >> >> >> Min. 1st Qu. Median Mean 3rd Qu. Max. >> >> 0.9762 0.9998 0.9999 0.9998 1.0000 1.0000 >> >> >> Seems to show that the model is fitting to the data virtually 100% which >> seems quite high to me. >> >> Looking at one probe individually supports this too: >> >> > summary(lm(dat2[1,]~-1+groups)) >> >> Call: >> >> lm(formula = dat2[1, ] ~ -1 + groups) >> >> <--Snip--> >> >> Residual standard error: 0.1418 on 22 degrees of freedom >> >> Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999 >> >> F-statistic: 1.913e+04 on 14 and 22 DF, p-value:< 2.2e-16 >> >> >> However, when I look at the simpler model: >> >> > summary(rsquared(simplerfit,dat2)) >> >> Min. 1st Qu. Median Mean 3rd Qu. Max. >> >> 0.9645 0.9997 0.9999 0.9997 1.0000 1.0000 >> >> >> The R squared is largely the same; which I would interpret as the >> additional interaction terms are adding little to the model. Would I be >> correct in this interpretation? >> >> Now, I would want to do something akin to a stepwise regression to >> determine the appropriate model I should be using: >> >> I can do this for a single probe as so: >> >> > step(lm(dat2[1,]~young.old + sex + treatment + hyb + batch + strain + >> treatment:young.old + young.old:sex + treatment:sex + batch:young.old)) >> >> Which suggests the correct model to use is: >> >> ~young.old + sex + treatment + batch + strain + young.old:sex >> >> However, that has obviously been computed for just a single probe and >> I'll get different results if I use a different probe as the basis for >> the model. I can't find any obvious way to do a similar analysis for >> the fit produced by lmFit to optimise the model for the entire set of >> probes. (i.e. find the best overall model). >> >> TL;DR: >> Broadly I'm looking for a way to minimise AIC (or another similar >> goodness of fit metric) by model selection for a produced by lmFit. Can >> anyone offer any suggestions on approaches which can do this? >> >> Tony McBryan >> >> --- >> targets.txt >> > FileName hyb strain treatment batch sex young- old >> > 40398_2.txt 1 S030 Ctrl 1 m old >> > 40398_3.txt 1 S030 Rot 3d 1 m old >> > 40398_4.txt 1 S030 Rot 3h 1 m old >> > 40399_2.txt 1 JS14 Rot 3d 1 m young >> > 40399_3.txt 1 JS14 Ctrl 1 m young >> > 40399_4.txt 1 JS14 Rot 3h 1 m young >> > 40400_2.txt 1 S175 Ctrl 1 f old >> > 40400_3.txt 1 S175 Rot 3d 1 f old >> > 40400_4.txt 1 S175 Rot 3h 1 f old >> > 40401_1.txt 1 JS15 Rot 3d 1 f young >> > 40401_2.txt 1 JS15 Rot 3h 1 f young >> > 40401_3.txt 1 JS15 Ctrl 1 f young >> > 40396_2.txt 2 S386 Rot 3h 1 m old >> > 40396_3.txt 2 S386 Rot 3d 1 m old >> > 40396_4.txt 2 S386 Ctrl 1 m old >> > 40397_1.txt 2 JS20 Ctrl 1 m young >> > 40397_2.txt 2 JS20 Rot 3h 1 m young >> > 40397_4.txt 2 JS20 Rot 3d 1 m young >> > 40415_2.txt 2 S569 Ctrl 3 f old >> > 40415_3.txt 2 S569 Rot 3d 3 f old >> > 40415_4.txt 2 S569 Rot 3h 3 f old >> > 40437_2.txt 2 JS10 Rot 3d 2 f young >> > 40437_3.txt 2 JS10 Rot 3h 2 f young >> > 40437_4.txt 2 JS10 Ctrl 2 f young >> > 40402_2.txt 3 S591 Rot 3d 2 m old >> > 40402_3.txt 3 S591 Ctrl 2 m old >> > 40402_4.txt 3 S591 Rot 3h 2 m old >> > 40403_2.txt 3 JS30 Rot 3d 2 m young >> > 40403_3.txt 3 JS30 Rot 3h 2 m young >> > 40403_4.txt 3 JS30 Ctrl 2 m young >> > 40404_2.txt 3 S534 Rot 3h 3 f old >> > 40404_3.txt 3 S534 Ctrl 3 f old >> > 40404_4.txt 3 S534 Rot 3d 3 f old >> > 40414_2.txt 3 JS05 Rot 3h 3 f young >> > 40414_3.txt 3 JS05 Rot 3d 3 f young >> > 40414_4.txt 3 JS05 Ctrl 3 f young >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > [[alternative HTML version deleted]]

ADD REPLY • link 13.8 years ago Steve Shen ▴ 330

0

Entering edit mode

Hi Steve (+list), I have a few more (hopefully quick) questions as a result of my looking at linear mixed effects models (specifically the lme function). 1. Is there any way to fit multiple random effects with lme? 2. Is there any support in bioconductor for fitting an lme to a microarray or should I write a bit of code to do that myself? Obviously no point in spending time on it if there's something in bioconductor already but I've not been able to find it. An lmFit type function but for lme's? 3. I have noticed that lme() reports an AIU for the model that it has fitted but it doesn't try and optimise the model selection in the same way that step() (at least not for fixed factors) so I'm still not sure to what degree I might be overfitting on the fixed factors (and their interactions). As an example the two models: AIU -46.76461 : s<- summary(lme(expression ~ young.old + treatment, random = ~1 | strain, data=d)) AIU -39.30473 : s<- summary(lme(expression ~ young.old + treatment + young.old:treatment, random = ~1 | strain, data=d)) I would have assumed if lme() was optimising model selection that the AIU scores for both would be the same as the young.old:treatment would have been dropped in order to make it equivalent to the first model. However, there does seem to be an lme equivalent to step in the MASS package which optimises AIC as well for that model (i.e. stepAIC) which does optimise the model by AIC. However, I think I'm still largely in the same situation as before where I can fit an individual probe using an lm() (or now an lme()) and minimise its AIC for that one probe using step (or stepAIC() from the MASS package for lme's) but where I'm still stuck in how to decide if there is a single model I can fit to the entire array. I suppose the alternative suggestion is that I should be fitting individually optimised models to each probe but I'm not convinced that makes much sense - particularly where one contrast may then appear / disappear from different probes? Any advice you can offer would be much appreciated. Tony McBryan On 12/07/10 15:06, Steve Shen wrote: > Tony, > > Here is one which I have done a while ago, but your case seems much > more complicated. It may also be a good idea to simplify your > variables first by doing Surrogate variable analysis. Good luck, -steve > > lme(expression ~ response * time, random = ~1 | patient, data=d) > d <- data.frame(expression = as.vector(data.matrix[i,], > mode="numeric"), response = targets$response, time = targets$time, > patient = targets$patient) > > > On Mon, Jul 12, 2010 at 9:30 AM, Tony McBryan <tony@mcbryan.co.uk> <mailto:tony@mcbryan.co.uk>> wrote: > > Hi Steve (+list), > > I have not tried that yet as I am unfamiliar with them. I've just > now read through some of the documentation for the lme() function > in the nlme library but I'm currently struggling to find > understandable examples using microarrays etc. > > Would you be able to provide me with a pointer to an example or a > vignette that has more information on how to do this for > microarray data or even an example of how it might be used to get > me on my way. > > If you have time would you be able to suggest what the model below > might look like when represented as a linear mixed effects model? > > Thanks, > > Tony McBryan > > > On 12/07/10 13:45, Steve Shen wrote: >> Hi Tony, >> >> Have you tried linear mixed effects model? It may be able to >> solve your problem. Thanks, >> >> Steve >> >> On Mon, Jul 12, 2010 at 8:29 AM, Tony McBryan <tony@mcbryan.co.uk>> <mailto:tony@mcbryan.co.uk>> wrote: >> >> Hello list, >> >> I'm using limma to analyse some agilent microarrays for an >> experiment >> that had a large number of variables and I'm running into a >> wall trying >> to decide on the best model to use. >> >> Briefly (and where R commands have been abbreviated for clarity); >> >> I'm using the AgiMicroRna package, reading in the arrays and >> the target >> file. Background correcting by normexp + offset and then >> quantile >> normalisation and converting to log values. Then I create a >> model using >> model.matrix and fit it using lmFit. >> >> The bit I'm having trouble with is deciding the appropriate >> model that I >> should use. We have 36 arrays in total. Split evenly by >> gender (m/f), >> age (young/old), treatment (a,b,c) as well as three hybridisation >> batches, three experimental batches, and 12 cell strains. >> (Targets.txt >> pasted below for reference). >> >> If I choose a model as such: >> >> groups<- model.matrix(~treatment + young.old + hyb + batch + sex) >> >> The results seem to broadly make sense from a biological >> standpoint. >> However, if I use: >> >> groups<- model.matrix(~young.old + sex + treatment + hyb + >> batch + strain + treatment:young.old + young.old:sex + >> treatment:sex + batch:young.old) >> >> which would include all the variables plus interaction >> effects then the >> results make less sense and generally don't overlap with the >> results of >> the first model. >> >> Thus, my suspicion would be that the latter model is too >> complex and is >> overfitting. Sensible? >> >> Looking at the R-Squared stats for the fit of the larger model: >> >> rsquared<- function(fit, data) >> { >> sst<-rowSums(data^2) >> ssr<-sst-fit$df.residual*fit$sigma^2 >> rsquared<- ssr/sst >> rsquared >> } >> >> > summary(rsquared(fit,data)) >> >> >> Min. 1st Qu. Median Mean 3rd Qu. Max. >> >> 0.9762 0.9998 0.9999 0.9998 1.0000 1.0000 >> >> >> Seems to show that the model is fitting to the data virtually >> 100% which >> seems quite high to me. >> >> Looking at one probe individually supports this too: >> >> > summary(lm(dat2[1,]~-1+groups)) >> >> Call: >> >> lm(formula = dat2[1, ] ~ -1 + groups) >> >> <--Snip--> >> >> Residual standard error: 0.1418 on 22 degrees of freedom >> >> Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999 >> >> F-statistic: 1.913e+04 on 14 and 22 DF, p-value:< 2.2e-16 >> >> >> However, when I look at the simpler model: >> >> > summary(rsquared(simplerfit,dat2)) >> >> Min. 1st Qu. Median Mean 3rd Qu. Max. >> >> 0.9645 0.9997 0.9999 0.9997 1.0000 1.0000 >> >> >> The R squared is largely the same; which I would interpret as the >> additional interaction terms are adding little to the model. >> Would I be >> correct in this interpretation? >> >> Now, I would want to do something akin to a stepwise >> regression to >> determine the appropriate model I should be using: >> >> I can do this for a single probe as so: >> >> > step(lm(dat2[1,]~young.old + sex + treatment + hyb + batch >> + strain + treatment:young.old + young.old:sex + >> treatment:sex + batch:young.old)) >> >> Which suggests the correct model to use is: >> >> ~young.old + sex + treatment + batch + strain + young.old:sex >> >> However, that has obviously been computed for just a single >> probe and >> I'll get different results if I use a different probe as the >> basis for >> the model. I can't find any obvious way to do a similar >> analysis for >> the fit produced by lmFit to optimise the model for the >> entire set of >> probes. (i.e. find the best overall model). >> >> TL;DR: >> Broadly I'm looking for a way to minimise AIC (or another similar >> goodness of fit metric) by model selection for a produced by >> lmFit. Can >> anyone offer any suggestions on approaches which can do this? >> >> Tony McBryan >> >> --- >> targets.txt >> > FileName hyb strain treatment batch sex >> young-old >> > 40398_2.txt 1 S030 Ctrl 1 m old >> > 40398_3.txt 1 S030 Rot 3d 1 m old >> > 40398_4.txt 1 S030 Rot 3h 1 m old >> > 40399_2.txt 1 JS14 Rot 3d 1 m young >> > 40399_3.txt 1 JS14 Ctrl 1 m young >> > 40399_4.txt 1 JS14 Rot 3h 1 m young >> > 40400_2.txt 1 S175 Ctrl 1 f old >> > 40400_3.txt 1 S175 Rot 3d 1 f old >> > 40400_4.txt 1 S175 Rot 3h 1 f old >> > 40401_1.txt 1 JS15 Rot 3d 1 f young >> > 40401_2.txt 1 JS15 Rot 3h 1 f young >> > 40401_3.txt 1 JS15 Ctrl 1 f young >> > 40396_2.txt 2 S386 Rot 3h 1 m old >> > 40396_3.txt 2 S386 Rot 3d 1 m old >> > 40396_4.txt 2 S386 Ctrl 1 m old >> > 40397_1.txt 2 JS20 Ctrl 1 m young >> > 40397_2.txt 2 JS20 Rot 3h 1 m young >> > 40397_4.txt 2 JS20 Rot 3d 1 m young >> > 40415_2.txt 2 S569 Ctrl 3 f old >> > 40415_3.txt 2 S569 Rot 3d 3 f old >> > 40415_4.txt 2 S569 Rot 3h 3 f old >> > 40437_2.txt 2 JS10 Rot 3d 2 f young >> > 40437_3.txt 2 JS10 Rot 3h 2 f young >> > 40437_4.txt 2 JS10 Ctrl 2 f young >> > 40402_2.txt 3 S591 Rot 3d 2 m old >> > 40402_3.txt 3 S591 Ctrl 2 m old >> > 40402_4.txt 3 S591 Rot 3h 2 m old >> > 40403_2.txt 3 JS30 Rot 3d 2 m young >> > 40403_3.txt 3 JS30 Rot 3h 2 m young >> > 40403_4.txt 3 JS30 Ctrl 2 m young >> > 40404_2.txt 3 S534 Rot 3h 3 f old >> > 40404_3.txt 3 S534 Ctrl 3 f old >> > 40404_4.txt 3 S534 Rot 3d 3 f old >> > 40414_2.txt 3 JS05 Rot 3h 3 f young >> > 40414_3.txt 3 JS05 Rot 3d 3 f young >> > 40414_4.txt 3 JS05 Ctrl 3 f young >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> <mailto:bioconductor@stat.math.ethz.ch> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > [[alternative HTML version deleted]]

ADD REPLY • link 13.8 years ago Tony McBryan ▴ 60

Login before adding your answer.