Entering edit mode
Dear Michael,
Thanks for your comments on limma and for your email. I definitely
agree
with you that constructing the design and contrasts matrices are the
hardest part of using limma - the amount of time I spend advising
people on
this makes me very aware of it. You're not the only person to have
asked
about factorial models, so I'll try to address some general issues.
Firstly, let me say that the factorial example in the User's Guide is
out
of data. The analysis would be rather simpler now using newer features
of
the designMatrix() and contrast.fit() functions. I won't go through
the
newer analysis here because the Weaver experiment is a direct design
while
you have a common reference design - a bit simpler.
Secondly, why doesn't limma allow you to use factorial formula like
'data ~
c*b*t' to construct your design matrix? There are a number of reasons.
a) ANOVA is not as simple as it appears. In you were to use the aov()
function in the data example that you give below, you would not get
out the
interactions and F-statistics that you expect. There just isn't enough
data
to use the factorial model approach. Even if there was more data,
microarray data is usually unbalanced because of quality weights or
missing
data so there won't be a unique anova table.
b) The formula aov() approach doesn't apply without further
development to
direct comparison two-color designs or to designs including dye-swaps
where
the dye-swap is not a factor to be fitted.
c) The most important reason though relates to the interpretation of
the
factorial model. The division of sums of squares into mean effects and
interactions is a generally useful technique but the interactions very
often do not correspond to contrasts of interest in themselves. A
factorial
ANOVA analysis usually proceeds in several steps. One first looks at
the
interactions to get an idea of the complexity of the data. If there
aren't
any interactions, then one can interpret the main effects directly. If
there are, then one has to look at the data in more detail to
determine
which specific contrasts are causing the interactions. I can't see how
one
can realistically go through this process for each of 20,000 ANOVA
tables.
I think that it is best to instead specify the contrasts of interest
in
advance. Unfortunately this means that you will need to do something
different for each analysis - it is hard to automate it.
Thirdly, note that if you really do have a formula which corresponds
to a
model that you want to fit, you can always use it in limma. Just use
design <- model.matrix( formula )
and then input that design matrix to lmFit().
In your three way factorial example, I really think that you may want
to
neglect the interactions between animal and infection & time, i.e.,
you
want a model like
data ~ animal + infection*time
However you need to know what all the fitted parameters mean, which
you
probably won't if you just let aov() or model.matrix() make a design
matrix
for you! Here is a suggestion. Suppose you have a targets file like
this:
> targets <- readTargets("temp.txt")
> targets
Cy3 Cy5
1 Reference Animal1UninfEarly
2 Reference Animal1UninfLate
3 Reference Animal1InfEarly
4 Reference Animla1InfLate
5 Reference Animal2UninfEarly
6 Reference Animal2UninfLate
7 Reference Animal2InfEarly
8 Reference Animal2InfLate
Now specify what you'd like the parameters in your linear model to
mean:
> parameters <- cbind(
+ UninfEarly=c(-1,0.5,0,0,0,0.5,0,0,0),
+ UninfLate=c(-1,0,0.5,0,0,0,0.5,0,0,),
+ InfEarly=c(-1,0,0,0.5,0,0,0,0.5,0),
+ InfLate=c(-1,0,0,0,0.5,0,0,0,0.5,),
+ AnimalDiff=c(0,-0.25,-0.25,-0.25,-0.25,0.25,0.25,0.25,0.25)
+ )
> rownames(parameters) <- c("Reference",targets$Cy5)
> parameters
UninfEarly UninfLate InfEarly InfLate AnimalDiff
Reference -1.0 -1.0 -1.0 -1.0 0.00
Animal1UninfEarly 0.5 0.0 0.0 0.0 -0.25
Animal1UninfLate 0.0 0.5 0.0 0.0 -0.25
Animal1InfEarly 0.0 0.0 0.5 0.0 -0.25
Animla1InfLate 0.0 0.0 0.0 0.5 -0.25
Animal2UninfEarly 0.5 0.0 0.0 0.0 0.25
Animal2UninfLate 0.0 0.5 0.0 0.0 0.25
Animal2InfEarly 0.0 0.0 0.5 0.0 0.25
Animal2InfLate 0.0 0.0 0.0 0.5 0.25
This means you want five coefficients. The first corresponds to the
average
log-ratio of uninfected animals at the early time versus the common
reference. The second is the average log-ratio of uninfected animals
at the
later time, etc. The fifth coefficient represents the difference
between
your two animals, a nuisance parameter probably. There are other ways
to do
this, the coefficients I've chosen are only one choice.
Now compute the design matrix. You don't need to try to understand the
detailed entries in the design matrix to interpret the coefficients.
> design <- designMatrix(targets,parameters)
Found unique target names:
Reference Animal1UninfEarly Animal1UninfLate Animal1InfEarly
Animla1InfLate Animal2UninfEarly Animal2UninfLate Animal2InfEarly
Animal2InfLate
Warning message:
number of parameters should be one less than number of targets in:
designMatrix(targets, parameters)
> design
UninfEarly UninfLate InfEarly InfLate AnimalDiff
1 1 0 0 0 -0.5
2 0 1 0 0 -0.5
3 0 0 1 0 -0.5
4 0 0 0 1 -0.5
5 1 0 0 0 0.5
6 0 1 0 0 0.5
7 0 0 1 0 0.5
8 0 0 0 1 0.5
> fit <- lmFit(MA, design)
Now you can start asking quesions. For example, what is the effect of
infection at the early time? What is the effect of infection at the
late
time? Does the effect of infection change between the late and early
times?
(This is usually called interaction.)
cont.matrix <-
cbind(InfvsUninfEarly=c(-1,0,1,0,0),InfvsUninfLate=c(0,-1,0,1,0),Early
vsLateEffect=c(1,-1,-1,1,0))
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
Note that you need a recent version of limma for this to work.
Regards
Gordon
At 03:42 AM 28/02/2004, michael watson (IAH-C) wrote:
>Hi
>
>First off, let me say that i think limma is a quite brilliant package
and
>I use it a lot. However, one of the biggest obstructions to using
limma
>for the lay biologist is an inability to understand the design
>matrix. Although there is a lot of documentation, showing the design
>matrix for a number of example problems, there is no discussion as to
how
>that design matrix was constructed i.e. the logical thought processes
that
>went into it.
>
>For the two-sample experiment given in the UserGuide, I understand
that
>there must be one row per array in my design matrix and the columns
>represent the coefficients I want to calculate. I represent the
>differences in the factors with 1's and 0's. Great, this is pretty
>similar to how I do it for aov().
>
>But then, all of a sudden, for the factorial experiment the design
matrix
>not only has -1's in there, but also a column for the interactions.
How
>do I decide which array/factor combination gets a 1, a 0 or a -1?
>
>Let me put this in perspective. I have a 3 factor experiment where
the
>factors are animal, infected/uninfected and time. All samples were
>hybridised against a common reference. For analysis of variance, all
I do
>is set up a data.frame that looks like this for each gene:
>
> data c b t
>1 2.9 1 1 1
>2 2.7 1 0 2
>3 2.8 1 1 1
>4 3.0 1 0 2
>5 -3.0 0 1 1
>6 -3.5 0 0 2
>7 -4.0 0 1 1
>8 -5.0 0 0 2
>
>where data is my data, and c, b and t are my factors, and then feed
in
>something like:
>
>(aov.aov <- aov(data ~ c*b*t, aov.data))
>
>and I get F-statistics for c, b and t and all possible interactions.
>
>Because of the limitations of analysis of variance for my microarray
data,
>I would like to use limma. Is there any *more* documentation I can
look
>at that will tell me the steps to take to work out what my limma
design
>matrix will look like?
>
>Kind regards
>
>Michael