Question

Select different linear models in voom

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 11.3 years ago

I have recently implemented the approach used in voom to estimate the mean and the variance of each log-cpm at the observational level. My dataset contains ~1000 samples, that features a discrete amount of metadata that may be used as covariates (~400). This allows, in principle, for a better construction of the linear model on which both the fitted mean and the fitted variance are estimated in voom, by simply including more factors. So far, I have used the AIC weights to test the probability for various linear models to be more likely to explain the data than the alternative models. Of course, testing all possible combinations of linear models is computationally infeasible (in principle, 2^400). However, even if I detected most gene are well explained by a simple LM, a non negligible fraction of them depend on additional factors. The point is the what makes the expression profile of a certain gene interesting, is when the covariates play an important role in determining its mean and variance. Therefore I am reluctant to use the simple LM because this would eliminate all the covariates. On the other hand, I am reluctant to use to more complicated LM because it clearly unnecessarily fits a large amount of genes. What is the best way to proceed? Thanks! -- output of sessionInfo(): R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] sv_SE.UTF-8/sv_SE.UTF-8/sv_SE.UTF-8/C/sv_SE.UTF-8/sv_SE.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_3.4.2 limma_3.18.9 loaded via a namespace (and not attached): [1] tools_3.0.2 -- Sent via the guest posting facility at bioconductor.org.

• 1.4k views

ADD COMMENT • link updated 11.8 years ago by Gordon Smyth 53k • written 11.8 years ago by Guest User ★ 13k

score 0 · Answer 1 · 2014-03-07

0

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 9 hours ago

WEHI, Melbourne, Australia

Dear Francesco, If you have 400 covariates and 1000 samples it would appear that you can feasibly use all covariates in a linear model at once. Does voom() work on your computer with this full model or does R run out of memory? It it works, then I would suggest running voom and limma on the full model as usual, then removing covariates one by one from the linear model (without re-running voom) if they result in no DE genes. In model selection theory, this is called "backward selection". Best wishes Gordon > Date: Thu, 6 Mar 2014 00:45:55 -0800 (PST) > From: "Francesco [guest]" <guest at="" bioconductor.org=""> > To: bioconductor at r-project.org, gatto at chalmers.se > Subject: [BioC] Select different linear models in voom > > > I have recently implemented the approach used in voom to estimate the > mean and the variance of each log-cpm at the observational level. My > dataset contains ~1000 samples, that features a discrete amount of > metadata that may be used as covariates (~400). This allows, in > principle, for a better construction of the linear model on which both > the fitted mean and the fitted variance are estimated in voom, by simply > including more factors. > > So far, I have used the AIC weights to test the probability for various > linear models to be more likely to explain the data than the alternative > models. Of course, testing all possible combinations of linear models is > computationally infeasible (in principle, 2^400). However, even if I > detected most gene are well explained by a simple LM, a non negligible > fraction of them depend on additional factors. > > The point is the what makes the expression profile of a certain gene > interesting, is when the covariates play an important role in > determining its mean and variance. Therefore I am reluctant to use the > simple LM because this would eliminate all the covariates. On the other > hand, I am reluctant to use to more complicated LM because it clearly > unnecessarily fits a large amount of genes. > > What is the best way to proceed? > > Thanks! > > -- output of sessionInfo(): > > R version 3.0.2 (2013-09-25) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] sv_SE.UTF-8/sv_SE.UTF-8/sv_SE.UTF-8/C/sv_SE.UTF-8/sv_SE.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] edgeR_3.4.2 limma_3.18.9 > > loaded via a namespace (and not attached): > [1] tools_3.0.2 > > -- ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 11.8 years ago Gordon Smyth 53k

0

Entering edit mode

Thanks for your reply. I can certainly run the regression on the 400 covariates. However, in my understanding, backward selection still requires ~2^400 steps (I honestly doubt that covariates have an independent effect on the response variable). Is there a more efficient way to operate? And when you suggest to remove a covariate if it results in no DE genes, do you mean that the coefficient of the corresponding factor is not significantly different from 0 for all the genes? Best regards, /Francesco On 7 mar 2014, at 00:29, Gordon K Smyth wrote: Dear Francesco, If you have 400 covariates and 1000 samples it would appear that you can feasibly use all covariates in a linear model at once. Does voom() work on your computer with this full model or does R run out of memory? It it works, then I would suggest running voom and limma on the full model as usual, then removing covariates one by one from the linear model (without re-running voom) if they result in no DE genes. In model selection theory, this is called "backward selection". Best wishes Gordon Date: Thu, 6 Mar 2014 00:45:55 -0800 (PST) From: "Francesco [guest]" <guest@bioconductor.org<mailto:guest@bioconductor.org>> To: bioconductor@r-project.org<mailto:bioconductor@r-project.org>, gatto@chalmers.se<mailto:gatto@chalmers.se> Subject: [BioC] Select different linear models in voom I have recently implemented the approach used in voom to estimate the mean and the variance of each log-cpm at the observational level. My dataset contains ~1000 samples, that features a discrete amount of metadata that may be used as covariates (~400). This allows, in principle, for a better construction of the linear model on which both the fitted mean and the fitted variance are estimated in voom, by simply including more factors. So far, I have used the AIC weights to test the probability for various linear models to be more likely to explain the data than the alternative models. Of course, testing all possible combinations of linear models is computationally infeasible (in principle, 2^400). However, even if I detected most gene are well explained by a simple LM, a non negligible fraction of them depend on additional factors. The point is the what makes the expression profile of a certain gene interesting, is when the covariates play an important role in determining its mean and variance. Therefore I am reluctant to use the simple LM because this would eliminate all the covariates. On the other hand, I am reluctant to use to more complicated LM because it clearly unnecessarily fits a large amount of genes. What is the best way to proceed? Thanks! -- output of sessionInfo(): R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] sv_SE.UTF-8/sv_SE.UTF-8/sv_SE.UTF-8/C/sv_SE.UTF-8/sv_SE.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_3.4.2 limma_3.18.9 loaded via a namespace (and not attached): [1] tools_3.0.2 -- ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:10}}

ADD REPLY • link 11.8 years ago Francesco Gatto ▴ 20

0

Entering edit mode

On Fri, 7 Mar 2014, Francesco Gatto wrote: > Thanks for your reply. I can certainly run the regression on the 400 > covariates. I assume this means you can run voom(), lmFit() and eBayes(). > However, in my understanding, backward selection still requires ~2^400 > steps It takes at most 400 steps. There are at most 400 covariate to remove. At each step you remove the covariate that produces least DE amongst those remaining. At each step, one call to summary(decideTests()) will identify which one to remove. > (I honestly doubt that covariates have an independent effect on the > response variable). Independence is not being assumed. > Is there a more efficient way to operate? And when you suggest to remove > a covariate if it results in no DE genes, do you mean that the > coefficient of the corresponding factor is not significantly different > from 0 for all the genes? Yes, but you can obviously choose a less stringent criteria for removal if that suits your problem. It's your data -- I've just made a suggestion. Gordon > Best regards, > > /Francesco > > On 7 mar 2014, at 00:29, Gordon K Smyth wrote: > > Dear Francesco, > > If you have 400 covariates and 1000 samples it would appear that you can feasibly use all covariates in a linear model at once. Does voom() work on your computer with this full model or does R run out of memory? > > It it works, then I would suggest running voom and limma on the full model as usual, then removing covariates one by one from the linear model (without re-running voom) if they result in no DE genes. In model selection theory, this is called "backward selection". > > Best wishes > Gordon > > Date: Thu, 6 Mar 2014 00:45:55 -0800 (PST) > From: "Francesco [guest]" > To: bioconductor at r-project.org, gatto at chalmers.se > Subject: [BioC] Select different linear models in voom > > > I have recently implemented the approach used in voom to estimate the mean and the variance of each log-cpm at the observational level. My dataset contains ~1000 samples, that features a discrete amount of metadata that may be used as covariates (~400). This allows, in principle, for a better construction of the linear model on which both the fitted mean and the fitted variance are estimated in voom, by simply including more factors. > > So far, I have used the AIC weights to test the probability for various linear models to be more likely to explain the data than the alternative models. Of course, testing all possible combinations of linear models is computationally infeasible (in principle, 2^400). However, even if I detected most gene are well explained by a simple LM, a non negligible fraction of them depend on additional factors. > > The point is the what makes the expression profile of a certain gene interesting, is when the covariates play an important role in determining its mean and variance. Therefore I am reluctant to use the simple LM because this would eliminate all the covariates. On the other hand, I am reluctant to use to more complicated LM because it clearly unnecessarily fits a large amount of genes. > > What is the best way to proceed? > > Thanks! > > -- output of sessionInfo(): > > R version 3.0.2 (2013-09-25) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] sv_SE.UTF-8/sv_SE.UTF-8/sv_SE.UTF-8/C/sv_SE.UTF-8/sv_SE.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] edgeR_3.4.2 limma_3.18.9 > > loaded via a namespace (and not attached): > [1] tools_3.0.2 > > -- ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 11.8 years ago Gordon Smyth 53k