when do linear models work?

0

Entering edit mode

Arne.Muller@aventis.com ▴ 620

@arnemulleraventiscom-466

Last seen 11.3 years ago

Hello All, I've two fundamental problems with linear models (lm), maybe you can help me to clearify these issues: 1. Irrespective of how many factors you use in your expriment, the relationship is always assumed to be linear. If you've a response vector Y and vector X of indeppendent variables, the Y ~ X basically assumes a straight line (with some kind of slope). If you do say Y ~ X + Z then one can think of the lm as a *flat* surface. The same is true for higher dimensions (X ~ dose + time + batch + gender + ... ) This assumtion is realy dangerous I think, since many treatment/response relationships are not linear. For example think about an experiment: I've 5 doses 0.0mM, 0.10mM, 0.25mM, 0.5mM and 1.0mM of a drug with which cell cultures get treated. The 0.1mM dose causes hardly any change in gene expression, whereas there's a big difference in gene expression at 0.25mM. Then at 0.5mM and 1.0mM the reponse is not much stronger than at 0.25mM. If one just looks at a single gene, then expression of this gene goes up quite strongly from 0.1mM to 0.25mM, and then expression flattens out for the higher doses. The response reaches saturation. Other resposnes are more like a logistic curve. This is a typical scenario. The problem is that many genes within one experiment behave like described above, otheres change linear others exponetial ... Could I still use lm for this kind of experiment? Would I've to decide on a gene by gene basis? 2. Some of the factors such as treament (T) for an experiment can only take say 2 distinct values: treated (t) and untreated (ut). Does a model such as Y ~ T make any sense in this case? Doesn't this assume a linear relationship between just 2 "clouds" of data (assume there are many samples for each factor level)? Even if one can clearly distinguish between t and ut - assuming a straight line may wrong. This is like drawing a straight line between two points. Just like in my example above with the different doses, you may have already reached some kind of saturation. Using such a model for prediction would then give wrong results. However, if one just wants to distinguish between t and ut, would the lm be a valid method? I'm reading some "beginners" literature about lm's, and I'm just trying to understand what's going on ... . Maybe you could comment on this. I'd be very interested in any explanation or clearification. kind regards, Arne -- Arne Muller, Ph.D. Toxicogenomics, Aventis Pharma arne dot muller domain=aventis com

DOSE DOSE • 1.5k views

ADD COMMENT • link 21.8 years ago Arne.Muller@aventis.com ▴ 620

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 11 days ago

United States

The linear model fit here is not what you think. Since we are using factors, this is an analysis of variance model, so there is no assumption of linearity per se. In other words, we are not testing to see if there is a linear relationship between say, treatment and no treatment. Instead what we are testing is to see if there is a difference in the mean expression of each gene at the two (or more) factor levels. So if you are testing the five different treatment levels you mention, you are really testing to see if the mean expression level for each gene is the same at all levels or not. If they are not, you then have to fit contrasts to see where they differ. You can also fit different contrasts to see if, say, the mean expression is the same at 0 mM and 0.1 mM, but then changes at 0.25 mM (here you would be comparing the mean expression of the 0 mM and 0.1 mM samples to the 0.25 mM samples). If the book(s) you are reading cover ANOVA, you should take a look at those sections, especially the parts about design matrices and contrasts. HTH, Jim James W. MacDonald Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623 >>> <arne.muller@aventis.com> 03/04/04 01:48PM >>> Hello All, I've two fundamental problems with linear models (lm), maybe you can help me to clearify these issues: 1. Irrespective of how many factors you use in your expriment, the relationship is always assumed to be linear. If you've a response vector Y and vector X of indeppendent variables, the Y ~ X basically assumes a straight line (with some kind of slope). If you do say Y ~ X + Z then one can think of the lm as a *flat* surface. The same is true for higher dimensions (X ~ dose + time + batch + gender + ... ) This assumtion is realy dangerous I think, since many treatment/response relationships are not linear. For example think about an experiment: I've 5 doses 0.0mM, 0.10mM, 0.25mM, 0.5mM and 1.0mM of a drug with which cell cultures get treated. The 0.1mM dose causes hardly any change in gene expression, whereas there's a big difference in gene expression at 0.25mM. Then at 0.5mM and 1.0mM the reponse is not much stronger than at 0.25mM. If one just looks at a single gene, then expression of this gene goes up quite strongly from 0.1mM to 0.25mM, and then expression flattens out for the higher doses. The response reaches saturation. Other resposnes are more like a logistic curve. This is a typical scenario. The problem is that many genes within one experiment behave like described above, otheres change linear others exponetial ... Could I still use lm for this kind of experiment? Would I've to decide on a gene by gene basis? 2. Some of the factors such as treament (T) for an experiment can only take say 2 distinct values: treated (t) and untreated (ut). Does a model such as Y ~ T make any sense in this case? Doesn't this assume a linear relationship between just 2 "clouds" of data (assume there are many samples for each factor level)? Even if one can clearly distinguish between t and ut - assuming a straight line may wrong. This is like drawing a straight line between two points. Just like in my example above with the different doses, you may have already reached some kind of saturation. Using such a model for prediction would then give wrong results. However, if one just wants to distinguish between t and ut, would the lm be a valid method? I'm reading some "beginners" literature about lm's, and I'm just trying to understand what's going on ... . Maybe you could comment on this. I'd be very interested in any explanation or clearification. kind regards, Arne -- Arne Muller, Ph.D. Toxicogenomics, Aventis Pharma arne dot muller domain=aventis com _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD COMMENT • link 21.8 years ago James W. MacDonald 68k

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 4.7 years ago

United States

Dear Arne, If you declare your factors to be factors "as.factor(x)" then lm creates indicator variables which allows a different mean for each treatment. These means will not lie on a straight line. So we are not assuming linearity in the sense you discuss below. The "linear" in linear models does not indicate that the data vary around a line. It indicates that the estimated effects are linear functions of the dependent variable (i.e. if you multiply all of your response variables by the same constant, the estimated effects are multiplied by the same constant. The t and F-tests are therefore independent of the measurement units. If you are using the log of the data, it means that your tests of statistical significance will not depend on whether you use log2, log10 or natural log.) --Naomi Altman At 01:48 PM 3/4/2004, Arne.Muller@aventis.com wrote: >Hello All, > >I've two fundamental problems with linear models (lm), maybe you can help me >to clearify these issues: > >1. Irrespective of how many factors you use in your expriment, the >relationship is always assumed to be linear. If you've a response vector Y >and vector X of indeppendent variables, the Y ~ X basically assumes a >straight line (with some kind of slope). If you do say Y ~ X + Z then one can >think of the lm as a *flat* surface. The same is true for higher dimensions >(X ~ dose + time + batch + gender + ... ) > >This assumtion is realy dangerous I think, since many treatment/response >relationships are not linear. For example think about an experiment: I've 5 >doses 0.0mM, 0.10mM, 0.25mM, 0.5mM and 1.0mM of a drug with which cell >cultures get treated. The 0.1mM dose causes hardly any change in gene >expression, whereas there's a big difference in gene expression at 0.25mM. >Then at 0.5mM and 1.0mM the reponse is not much stronger than at 0.25mM. > >If one just looks at a single gene, then expression of this gene goes up >quite strongly from 0.1mM to 0.25mM, and then expression flattens out for the >higher doses. The response reaches saturation. Other resposnes are more like >a logistic curve. This is a typical scenario. > >The problem is that many genes within one experiment behave like described >above, otheres change linear others exponetial ... > >Could I still use lm for this kind of experiment? Would I've to decide on a >gene by gene basis? > >2. Some of the factors such as treament (T) for an experiment can only take >say 2 distinct values: treated (t) and untreated (ut). Does a model such as Y >~ T make any sense in this case? > >Doesn't this assume a linear relationship between just 2 "clouds" of data >(assume there are many samples for each factor level)? Even if one can >clearly distinguish between t and ut - assuming a straight line may wrong. >This is like drawing a straight line between two points. Just like in my >example above with the different doses, you may have already reached some >kind of saturation. Using such a model for prediction would then give wrong >results. > >However, if one just wants to distinguish between t and ut, would the lm be a >valid method? > >I'm reading some "beginners" literature about lm's, and I'm just trying to >understand what's going on ... . > >Maybe you could comment on this. I'd be very interested in any explanation or >clearification. > > kind regards, > > Arne > >-- >Arne Muller, Ph.D. >Toxicogenomics, Aventis Pharma >arne dot muller domain=aventis com > >_______________________________________________ >Bioconductor mailing list >Bioconductor@stat.math.ethz.ch >https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Bioinformatics Consulting Center Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD COMMENT • link 21.8 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

Arne.Muller@aventis.com ▴ 620

@arnemulleraventiscom-466

Last seen 11.3 years ago

Hello, thanks for your reply. This clearifies the situation a bit. In terms of ANOVA this makes a lot more sense! Nevertheless, if you create a lm in R, you can apply summary() or anova(), giving you different p-values. I was wondering what the differnece is, does summary() is the p-value for the coefficients? In addition, the anova is based on the lm, if the relatioship between the factor levels is not lenear, does it matter? kind regards, Arne ps: please let me know if you think this discussion get too much off topic - i.e. to much stats rather than BioC. > -----Original Message----- > From: James MacDonald [mailto:jmacdon@med.umich.edu] > Sent: 04 March 2004 21:28 > To: Muller, Arne PH/FR; bioconductor@stat.math.ethz.ch > Subject: Re: [BioC] when do linear models work? > > > The linear model fit here is not what you think. Since we are using > factors, this is an analysis of variance model, so there is no > assumption of linearity per se. In other words, we are not testing to > see if there is a linear relationship between say, treatment and no > treatment. Instead what we are testing is to see if there is a > difference in the mean expression of each gene at the two (or more) > factor levels. > > So if you are testing the five different treatment levels you mention, > you are really testing to see if the mean expression level > for each gene > is the same at all levels or not. If they are not, you then > have to fit > contrasts to see where they differ. You can also fit > different contrasts > to see if, say, the mean expression is the same at 0 mM and > 0.1 mM, but > then changes at 0.25 mM (here you would be comparing the mean > expression > of the 0 mM and 0.1 mM samples to the 0.25 mM samples). > > If the book(s) you are reading cover ANOVA, you should take a look at > those sections, especially the parts about design matrices and > contrasts. > > HTH, > > Jim > > > > James W. MacDonald > Affymetrix and cDNA Microarray Core > University of Michigan Cancer Center > 1500 E. Medical Center Drive > 7410 CCGC > Ann Arbor MI 48109 > 734-647-5623 > > >>> <arne.muller@aventis.com> 03/04/04 01:48PM >>> > Hello All, > > I've two fundamental problems with linear models (lm), maybe you can > help me > to clearify these issues: > > 1. Irrespective of how many factors you use in your expriment, the > relationship is always assumed to be linear. If you've a response > vector Y > and vector X of indeppendent variables, the Y ~ X basically assumes a > straight line (with some kind of slope). If you do say Y ~ X + Z then > one can > think of the lm as a *flat* surface. The same is true for higher > dimensions > (X ~ dose + time + batch + gender + ... ) > > This assumtion is realy dangerous I think, since many > treatment/response > relationships are not linear. For example think about an experiment: > I've 5 > doses 0.0mM, 0.10mM, 0.25mM, 0.5mM and 1.0mM of a drug with which cell > cultures get treated. The 0.1mM dose causes hardly any change in gene > expression, whereas there's a big difference in gene expression at > 0.25mM. > Then at 0.5mM and 1.0mM the reponse is not much stronger than at > 0.25mM. > > If one just looks at a single gene, then expression of this gene goes > up > quite strongly from 0.1mM to 0.25mM, and then expression flattens out > for the > higher doses. The response reaches saturation. Other > resposnes are more > like > a logistic curve. This is a typical scenario. > > The problem is that many genes within one experiment behave like > described > above, otheres change linear others exponetial ... > > Could I still use lm for this kind of experiment? Would I've to decide > on a > gene by gene basis? > > 2. Some of the factors such as treament (T) for an experiment can only > take > say 2 distinct values: treated (t) and untreated (ut). Does a model > such as Y > ~ T make any sense in this case? > > Doesn't this assume a linear relationship between just 2 "clouds" of > data > (assume there are many samples for each factor level)? Even if one can > clearly distinguish between t and ut - assuming a straight line may > wrong. > This is like drawing a straight line between two points. Just like in > my > example above with the different doses, you may have already reached > some > kind of saturation. Using such a model for prediction would then give > wrong > results. > > However, if one just wants to distinguish between t and ut, would the > lm be a > valid method? > > I'm reading some "beginners" literature about lm's, and I'm > just trying > to > understand what's going on ... . > > Maybe you could comment on this. I'd be very interested in any > explanation or > clearification. > > kind regards, > > Arne > > -- > Arne Muller, Ph.D. > Toxicogenomics, Aventis Pharma > arne dot muller domain=aventis com > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor >

ADD COMMENT • link 21.8 years ago Arne.Muller@aventis.com ▴ 620

Login before adding your answer.