Question

choosing explanatory variables for linear model in limma

0

Entering edit mode

Andre J. Aberer ▴ 10

@andre-j-aberer-3659

Last seen 11.4 years ago

Dear list members, short version of my question: How can I determine, whether it improves the model quality of a linear model (in limma), when I introduce additional explanatory variables? Is there an equivalent to feature selection (as in machine learning) for choosing the explanatory variables? The complete story: We analyse a dataset of about ninety single channel microarray chips and we want to search for differentially expressed genes and enriched gene sets. The chips are annotated with information (at least 20 factors, could be extended to 50) like the organ from which the RNA was extracted, the experimenter that did the lab work, the labelling kit she used and a huge amount of features describing e.g. the genotype of the individual or different aspects of the disease. We would like to build one linear model (resp. one design matrix) with all of the factors of interest mentioned above as explanatory variables in order to test various contrasts. Of course, we have to include all the variables that we possibly want to test in the linear model. But what about the ``technical'' factors like the ``labelling kit'' that was used? One never might want to test a contrast using this explanatory variable, however the net chip intensity could be influenced by a technical factor like this. So how can I determine, if it makes sense to include this variable? I am using the standard procedure as described in the limma guide: designMatrix <- model.matrix(~0 + var1 + var2, data=someTable) fitBoth <- lmFit(eset, designMatrix) where var1 and var2 are variables like ``diseaseOutcome'' and ``labellingKit''. We thought, that maybe an anova table could help us here, showing us the influences of var1 and var2. As far as I read (e.g. http://data.princeton.edu/R/linearModels.html) the anvoa function can be simply applied to a lm object or can be used to compare two lm instances. Of course, in that case it is only applied to one linear model and not one per gene as in the limma setting. So, if I try anova for one or two limma fit objects (MArrayLM), R complains that there is no applicable method and other anova variants (like anova.lm) do not work neither. This holds as well, when I want to do an anova for just one extracted linear model for one gene (like anova(lmFit[1,])). Our ultimo ratio so far is, to build a design matrix with and another one without a certain explanatory variable. Then we would determine the top DEGs and compare for each DEG their fitted linear models in an anova table. Finally we could check for how many of the top DEGs the additional variable would make a difference. However, this does not seem to be the golden path...or are we completely on the wrong track? -- Best regards, Andre J. Aberer

Microarray limma Microarray limma • 1.4k views

ADD COMMENT • link updated 16.5 years ago by James W. MacDonald 68k • written 16.5 years ago by Andre J. Aberer ▴ 10

score 2 · Answer 1 · 2009-09-03

Hi Andre, If you want to do model selection, then limma is probably not the tool for the job. Instead, what I would do would be to choose some (one, five, ten, whatever) genes and use lm() for the model selection process. That way you can do all the conventional model selection steps, and once you are satisfied with the model you have chosen you can go back to limma and fit the model on all the genes. Best, Jim Andre J. Aberer wrote: > Dear list members, > > short version of my question: > How can I determine, whether it improves the model quality of a linear > model (in limma), when I introduce additional explanatory variables? Is > there an equivalent to feature selection (as in machine learning) for > choosing the explanatory variables? > > The complete story: > We analyse a dataset of about ninety single channel microarray chips and > we want to search for differentially expressed genes and enriched gene > sets. The chips are annotated with information (at least 20 factors, > could be extended to 50) like the organ from which the RNA was > extracted, the experimenter that did the lab work, the labelling kit she > used and a huge amount of features describing e.g. the genotype of the > individual or different aspects of the disease. > > We would like to build one linear model (resp. one design matrix) with > all of the factors of interest mentioned above as explanatory variables > in order to test various contrasts. Of course, we have to include all > the variables that we possibly want to test in the linear model. But > what about the ``technical'' factors like the ``labelling kit'' that was > used? One never might want to test a contrast using this explanatory > variable, however the net chip intensity could be influenced by a > technical factor like this. So how can I determine, if it makes sense to > include this variable? > > I am using the standard procedure as described in the limma guide: > designMatrix <- model.matrix(~0 + var1 + var2, data=someTable) > fitBoth <- lmFit(eset, designMatrix) > where var1 and var2 are variables like ``diseaseOutcome'' and > ``labellingKit''. > > We thought, that maybe an anova table could help us here, showing us the > influences of var1 and var2. As far as I read > (e.g. http://data.princeton.edu/R/linearModels.html) the anvoa function > can be simply applied to a lm object or can be used to compare two lm > instances. Of course, in that case it is only applied to one linear model > and not one per gene as in the limma setting. > So, if I try anova for one or two limma fit objects (MArrayLM), R > complains that there is no applicable method and other anova variants > (like anova.lm) do not work neither. This holds as well, when I want to > do an anova for just one extracted linear model for one gene > (like anova(lmFit[1,])). > > Our ultimo ratio so far is, to build a design matrix with and another > one without a certain explanatory variable. Then we would determine the > top DEGs and compare for each DEG their fitted linear models in an anova > table. Finally we could check for how many of the top DEGs the > additional variable would make a difference. > However, this does not seem to be the golden path...or are we completely > on the wrong track? > -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826