Hello All,
I've two fundamental problems with linear models (lm), maybe you can
help me
to clearify these issues:
1. Irrespective of how many factors you use in your expriment, the
relationship is always assumed to be linear. If you've a response
vector Y
and vector X of indeppendent variables, the Y ~ X basically assumes a
straight line (with some kind of slope). If you do say Y ~ X + Z then
one can
think of the lm as a *flat* surface. The same is true for higher
dimensions
(X ~ dose + time + batch + gender + ... )
This assumtion is realy dangerous I think, since many
treatment/response
relationships are not linear. For example think about an experiment:
I've 5
doses 0.0mM, 0.10mM, 0.25mM, 0.5mM and 1.0mM of a drug with which cell
cultures get treated. The 0.1mM dose causes hardly any change in gene
expression, whereas there's a big difference in gene expression at
0.25mM.
Then at 0.5mM and 1.0mM the reponse is not much stronger than at
0.25mM.
If one just looks at a single gene, then expression of this gene goes
up
quite strongly from 0.1mM to 0.25mM, and then expression flattens out
for the
higher doses. The response reaches saturation. Other resposnes are
more like
a logistic curve. This is a typical scenario.
The problem is that many genes within one experiment behave like
described
above, otheres change linear others exponetial ...
Could I still use lm for this kind of experiment? Would I've to decide
on a
gene by gene basis?
2. Some of the factors such as treament (T) for an experiment can only
take
say 2 distinct values: treated (t) and untreated (ut). Does a model
such as Y
~ T make any sense in this case?
Doesn't this assume a linear relationship between just 2 "clouds" of
data
(assume there are many samples for each factor level)? Even if one can
clearly distinguish between t and ut - assuming a straight line may
wrong.
This is like drawing a straight line between two points. Just like in
my
example above with the different doses, you may have already reached
some
kind of saturation. Using such a model for prediction would then give
wrong
results.
However, if one just wants to distinguish between t and ut, would the
lm be a
valid method?
I'm reading some "beginners" literature about lm's, and I'm just
trying to
understand what's going on ... .
Maybe you could comment on this. I'd be very interested in any
explanation or
clearification.
kind regards,
Arne
--
Arne Muller, Ph.D.
Toxicogenomics, Aventis Pharma
arne dot muller domain=aventis com
The linear model fit here is not what you think. Since we are using
factors, this is an analysis of variance model, so there is no
assumption of linearity per se. In other words, we are not testing to
see if there is a linear relationship between say, treatment and no
treatment. Instead what we are testing is to see if there is a
difference in the mean expression of each gene at the two (or more)
factor levels.
So if you are testing the five different treatment levels you mention,
you are really testing to see if the mean expression level for each
gene
is the same at all levels or not. If they are not, you then have to
fit
contrasts to see where they differ. You can also fit different
contrasts
to see if, say, the mean expression is the same at 0 mM and 0.1 mM,
but
then changes at 0.25 mM (here you would be comparing the mean
expression
of the 0 mM and 0.1 mM samples to the 0.25 mM samples).
If the book(s) you are reading cover ANOVA, you should take a look at
those sections, especially the parts about design matrices and
contrasts.
HTH,
Jim
James W. MacDonald
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623
>>> <arne.muller@aventis.com> 03/04/04 01:48PM >>>
Hello All,
I've two fundamental problems with linear models (lm), maybe you can
help me
to clearify these issues:
1. Irrespective of how many factors you use in your expriment, the
relationship is always assumed to be linear. If you've a response
vector Y
and vector X of indeppendent variables, the Y ~ X basically assumes a
straight line (with some kind of slope). If you do say Y ~ X + Z then
one can
think of the lm as a *flat* surface. The same is true for higher
dimensions
(X ~ dose + time + batch + gender + ... )
This assumtion is realy dangerous I think, since many
treatment/response
relationships are not linear. For example think about an experiment:
I've 5
doses 0.0mM, 0.10mM, 0.25mM, 0.5mM and 1.0mM of a drug with which cell
cultures get treated. The 0.1mM dose causes hardly any change in gene
expression, whereas there's a big difference in gene expression at
0.25mM.
Then at 0.5mM and 1.0mM the reponse is not much stronger than at
0.25mM.
If one just looks at a single gene, then expression of this gene goes
up
quite strongly from 0.1mM to 0.25mM, and then expression flattens out
for the
higher doses. The response reaches saturation. Other resposnes are
more
like
a logistic curve. This is a typical scenario.
The problem is that many genes within one experiment behave like
described
above, otheres change linear others exponetial ...
Could I still use lm for this kind of experiment? Would I've to decide
on a
gene by gene basis?
2. Some of the factors such as treament (T) for an experiment can only
take
say 2 distinct values: treated (t) and untreated (ut). Does a model
such as Y
~ T make any sense in this case?
Doesn't this assume a linear relationship between just 2 "clouds" of
data
(assume there are many samples for each factor level)? Even if one can
clearly distinguish between t and ut - assuming a straight line may
wrong.
This is like drawing a straight line between two points. Just like in
my
example above with the different doses, you may have already reached
some
kind of saturation. Using such a model for prediction would then give
wrong
results.
However, if one just wants to distinguish between t and ut, would the
lm be a
valid method?
I'm reading some "beginners" literature about lm's, and I'm just
trying
to
understand what's going on ... .
Maybe you could comment on this. I'd be very interested in any
explanation or
clearification.
kind regards,
Arne
--
Arne Muller, Ph.D.
Toxicogenomics, Aventis Pharma
arne dot muller domain=aventis com
_______________________________________________
Bioconductor mailing list
Bioconductor@stat.math.ethz.ch
https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
Dear Arne,
If you declare your factors to be factors "as.factor(x)" then lm
creates
indicator variables which allows a different mean for each
treatment. These means will not lie on a straight line. So we are
not
assuming linearity in the sense you discuss below.
The "linear" in linear models does not indicate that the data vary
around a
line. It indicates that the estimated effects are linear functions of
the
dependent variable (i.e. if you multiply all of your response
variables by
the same constant, the estimated effects are multiplied by the same
constant. The t and F-tests are therefore independent of the
measurement
units. If you are using the log of the data, it means that your tests
of
statistical significance will not depend on whether you use log2,
log10 or
natural log.)
--Naomi Altman
At 01:48 PM 3/4/2004, Arne.Muller@aventis.com wrote:
>Hello All,
>
>I've two fundamental problems with linear models (lm), maybe you can
help me
>to clearify these issues:
>
>1. Irrespective of how many factors you use in your expriment, the
>relationship is always assumed to be linear. If you've a response
vector Y
>and vector X of indeppendent variables, the Y ~ X basically assumes a
>straight line (with some kind of slope). If you do say Y ~ X + Z then
one can
>think of the lm as a *flat* surface. The same is true for higher
dimensions
>(X ~ dose + time + batch + gender + ... )
>
>This assumtion is realy dangerous I think, since many
treatment/response
>relationships are not linear. For example think about an experiment:
I've 5
>doses 0.0mM, 0.10mM, 0.25mM, 0.5mM and 1.0mM of a drug with which
cell
>cultures get treated. The 0.1mM dose causes hardly any change in gene
>expression, whereas there's a big difference in gene expression at
0.25mM.
>Then at 0.5mM and 1.0mM the reponse is not much stronger than at
0.25mM.
>
>If one just looks at a single gene, then expression of this gene goes
up
>quite strongly from 0.1mM to 0.25mM, and then expression flattens out
for the
>higher doses. The response reaches saturation. Other resposnes are
more like
>a logistic curve. This is a typical scenario.
>
>The problem is that many genes within one experiment behave like
described
>above, otheres change linear others exponetial ...
>
>Could I still use lm for this kind of experiment? Would I've to
decide on a
>gene by gene basis?
>
>2. Some of the factors such as treament (T) for an experiment can
only take
>say 2 distinct values: treated (t) and untreated (ut). Does a model
such as Y
>~ T make any sense in this case?
>
>Doesn't this assume a linear relationship between just 2 "clouds" of
data
>(assume there are many samples for each factor level)? Even if one
can
>clearly distinguish between t and ut - assuming a straight line may
wrong.
>This is like drawing a straight line between two points. Just like in
my
>example above with the different doses, you may have already reached
some
>kind of saturation. Using such a model for prediction would then give
wrong
>results.
>
>However, if one just wants to distinguish between t and ut, would the
lm be a
>valid method?
>
>I'm reading some "beginners" literature about lm's, and I'm just
trying to
>understand what's going on ... .
>
>Maybe you could comment on this. I'd be very interested in any
explanation or
>clearification.
>
> kind regards,
>
> Arne
>
>--
>Arne Muller, Ph.D.
>Toxicogenomics, Aventis Pharma
>arne dot muller domain=aventis com
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor@stat.math.ethz.ch
>https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
Naomi S. Altman 814-865-3791 (voice)
Associate Professor
Bioinformatics Consulting Center
Dept. of Statistics 814-863-7114 (fax)
Penn State University 814-865-1348
(Statistics)
University Park, PA 16802-2111
Hello,
thanks for your reply. This clearifies the situation a bit. In terms
of ANOVA
this makes a lot more sense!
Nevertheless, if you create a lm in R, you can apply summary() or
anova(),
giving you different p-values. I was wondering what the differnece is,
does
summary() is the p-value for the coefficients?
In addition, the anova is based on the lm, if the relatioship between
the
factor levels is not lenear, does it matter?
kind regards,
Arne
ps: please let me know if you think this discussion get too much off
topic -
i.e. to much stats rather than BioC.
> -----Original Message-----
> From: James MacDonald [mailto:jmacdon@med.umich.edu]
> Sent: 04 March 2004 21:28
> To: Muller, Arne PH/FR; bioconductor@stat.math.ethz.ch
> Subject: Re: [BioC] when do linear models work?
>
>
> The linear model fit here is not what you think. Since we are using
> factors, this is an analysis of variance model, so there is no
> assumption of linearity per se. In other words, we are not testing
to
> see if there is a linear relationship between say, treatment and no
> treatment. Instead what we are testing is to see if there is a
> difference in the mean expression of each gene at the two (or more)
> factor levels.
>
> So if you are testing the five different treatment levels you
mention,
> you are really testing to see if the mean expression level
> for each gene
> is the same at all levels or not. If they are not, you then
> have to fit
> contrasts to see where they differ. You can also fit
> different contrasts
> to see if, say, the mean expression is the same at 0 mM and
> 0.1 mM, but
> then changes at 0.25 mM (here you would be comparing the mean
> expression
> of the 0 mM and 0.1 mM samples to the 0.25 mM samples).
>
> If the book(s) you are reading cover ANOVA, you should take a look
at
> those sections, especially the parts about design matrices and
> contrasts.
>
> HTH,
>
> Jim
>
>
>
> James W. MacDonald
> Affymetrix and cDNA Microarray Core
> University of Michigan Cancer Center
> 1500 E. Medical Center Drive
> 7410 CCGC
> Ann Arbor MI 48109
> 734-647-5623
>
> >>> <arne.muller@aventis.com> 03/04/04 01:48PM >>>
> Hello All,
>
> I've two fundamental problems with linear models (lm), maybe you can
> help me
> to clearify these issues:
>
> 1. Irrespective of how many factors you use in your expriment, the
> relationship is always assumed to be linear. If you've a response
> vector Y
> and vector X of indeppendent variables, the Y ~ X basically assumes
a
> straight line (with some kind of slope). If you do say Y ~ X + Z
then
> one can
> think of the lm as a *flat* surface. The same is true for higher
> dimensions
> (X ~ dose + time + batch + gender + ... )
>
> This assumtion is realy dangerous I think, since many
> treatment/response
> relationships are not linear. For example think about an experiment:
> I've 5
> doses 0.0mM, 0.10mM, 0.25mM, 0.5mM and 1.0mM of a drug with which
cell
> cultures get treated. The 0.1mM dose causes hardly any change in
gene
> expression, whereas there's a big difference in gene expression at
> 0.25mM.
> Then at 0.5mM and 1.0mM the reponse is not much stronger than at
> 0.25mM.
>
> If one just looks at a single gene, then expression of this gene
goes
> up
> quite strongly from 0.1mM to 0.25mM, and then expression flattens
out
> for the
> higher doses. The response reaches saturation. Other
> resposnes are more
> like
> a logistic curve. This is a typical scenario.
>
> The problem is that many genes within one experiment behave like
> described
> above, otheres change linear others exponetial ...
>
> Could I still use lm for this kind of experiment? Would I've to
decide
> on a
> gene by gene basis?
>
> 2. Some of the factors such as treament (T) for an experiment can
only
> take
> say 2 distinct values: treated (t) and untreated (ut). Does a model
> such as Y
> ~ T make any sense in this case?
>
> Doesn't this assume a linear relationship between just 2 "clouds" of
> data
> (assume there are many samples for each factor level)? Even if one
can
> clearly distinguish between t and ut - assuming a straight line may
> wrong.
> This is like drawing a straight line between two points. Just like
in
> my
> example above with the different doses, you may have already reached
> some
> kind of saturation. Using such a model for prediction would then
give
> wrong
> results.
>
> However, if one just wants to distinguish between t and ut, would
the
> lm be a
> valid method?
>
> I'm reading some "beginners" literature about lm's, and I'm
> just trying
> to
> understand what's going on ... .
>
> Maybe you could comment on this. I'd be very interested in any
> explanation or
> clearification.
>
> kind regards,
>
> Arne
>
> --
> Arne Muller, Ph.D.
> Toxicogenomics, Aventis Pharma
> arne dot muller domain=aventis com
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@stat.math.ethz.ch
> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
>