Question

limma design matrix

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 9 hours ago

WEHI, Melbourne, Australia

Dear Michael, Thanks for your comments on limma and for your email. I definitely agree with you that constructing the design and contrasts matrices are the hardest part of using limma - the amount of time I spend advising people on this makes me very aware of it. You're not the only person to have asked about factorial models, so I'll try to address some general issues. Firstly, let me say that the factorial example in the User's Guide is out of data. The analysis would be rather simpler now using newer features of the designMatrix() and contrast.fit() functions. I won't go through the newer analysis here because the Weaver experiment is a direct design while you have a common reference design - a bit simpler. Secondly, why doesn't limma allow you to use factorial formula like 'data ~ c*b*t' to construct your design matrix? There are a number of reasons. a) ANOVA is not as simple as it appears. In you were to use the aov() function in the data example that you give below, you would not get out the interactions and F-statistics that you expect. There just isn't enough data to use the factorial model approach. Even if there was more data, microarray data is usually unbalanced because of quality weights or missing data so there won't be a unique anova table. b) The formula aov() approach doesn't apply without further development to direct comparison two-color designs or to designs including dye-swaps where the dye-swap is not a factor to be fitted. c) The most important reason though relates to the interpretation of the factorial model. The division of sums of squares into mean effects and interactions is a generally useful technique but the interactions very often do not correspond to contrasts of interest in themselves. A factorial ANOVA analysis usually proceeds in several steps. One first looks at the interactions to get an idea of the complexity of the data. If there aren't any interactions, then one can interpret the main effects directly. If there are, then one has to look at the data in more detail to determine which specific contrasts are causing the interactions. I can't see how one can realistically go through this process for each of 20,000 ANOVA tables. I think that it is best to instead specify the contrasts of interest in advance. Unfortunately this means that you will need to do something different for each analysis - it is hard to automate it. Thirdly, note that if you really do have a formula which corresponds to a model that you want to fit, you can always use it in limma. Just use design <- model.matrix( formula ) and then input that design matrix to lmFit(). In your three way factorial example, I really think that you may want to neglect the interactions between animal and infection & time, i.e., you want a model like data ~ animal + infection*time However you need to know what all the fitted parameters mean, which you probably won't if you just let aov() or model.matrix() make a design matrix for you! Here is a suggestion. Suppose you have a targets file like this: > targets <- readTargets("temp.txt") > targets Cy3 Cy5 1 Reference Animal1UninfEarly 2 Reference Animal1UninfLate 3 Reference Animal1InfEarly 4 Reference Animla1InfLate 5 Reference Animal2UninfEarly 6 Reference Animal2UninfLate 7 Reference Animal2InfEarly 8 Reference Animal2InfLate Now specify what you'd like the parameters in your linear model to mean: > parameters <- cbind( + UninfEarly=c(-1,0.5,0,0,0,0.5,0,0,0), + UninfLate=c(-1,0,0.5,0,0,0,0.5,0,0,), + InfEarly=c(-1,0,0,0.5,0,0,0,0.5,0), + InfLate=c(-1,0,0,0,0.5,0,0,0,0.5,), + AnimalDiff=c(0,-0.25,-0.25,-0.25,-0.25,0.25,0.25,0.25,0.25) + ) > rownames(parameters) <- c("Reference",targets$Cy5) > parameters UninfEarly UninfLate InfEarly InfLate AnimalDiff Reference -1.0 -1.0 -1.0 -1.0 0.00 Animal1UninfEarly 0.5 0.0 0.0 0.0 -0.25 Animal1UninfLate 0.0 0.5 0.0 0.0 -0.25 Animal1InfEarly 0.0 0.0 0.5 0.0 -0.25 Animla1InfLate 0.0 0.0 0.0 0.5 -0.25 Animal2UninfEarly 0.5 0.0 0.0 0.0 0.25 Animal2UninfLate 0.0 0.5 0.0 0.0 0.25 Animal2InfEarly 0.0 0.0 0.5 0.0 0.25 Animal2InfLate 0.0 0.0 0.0 0.5 0.25 This means you want five coefficients. The first corresponds to the average log-ratio of uninfected animals at the early time versus the common reference. The second is the average log-ratio of uninfected animals at the later time, etc. The fifth coefficient represents the difference between your two animals, a nuisance parameter probably. There are other ways to do this, the coefficients I've chosen are only one choice. Now compute the design matrix. You don't need to try to understand the detailed entries in the design matrix to interpret the coefficients. > design <- designMatrix(targets,parameters) Found unique target names: Reference Animal1UninfEarly Animal1UninfLate Animal1InfEarly Animla1InfLate Animal2UninfEarly Animal2UninfLate Animal2InfEarly Animal2InfLate Warning message: number of parameters should be one less than number of targets in: designMatrix(targets, parameters) > design UninfEarly UninfLate InfEarly InfLate AnimalDiff 1 1 0 0 0 -0.5 2 0 1 0 0 -0.5 3 0 0 1 0 -0.5 4 0 0 0 1 -0.5 5 1 0 0 0 0.5 6 0 1 0 0 0.5 7 0 0 1 0 0.5 8 0 0 0 1 0.5 > fit <- lmFit(MA, design) Now you can start asking quesions. For example, what is the effect of infection at the early time? What is the effect of infection at the late time? Does the effect of infection change between the late and early times? (This is usually called interaction.) cont.matrix <- cbind(InfvsUninfEarly=c(-1,0,1,0,0),InfvsUninfLate=c(0,-1,0,1,0),Early vsLateEffect=c(1,-1,-1,1,0)) fit2 <- contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2) Note that you need a recent version of limma for this to work. Regards Gordon At 03:42 AM 28/02/2004, michael watson (IAH-C) wrote: >Hi > >First off, let me say that i think limma is a quite brilliant package and >I use it a lot. However, one of the biggest obstructions to using limma >for the lay biologist is an inability to understand the design >matrix. Although there is a lot of documentation, showing the design >matrix for a number of example problems, there is no discussion as to how >that design matrix was constructed i.e. the logical thought processes that >went into it. > >For the two-sample experiment given in the UserGuide, I understand that >there must be one row per array in my design matrix and the columns >represent the coefficients I want to calculate. I represent the >differences in the factors with 1's and 0's. Great, this is pretty >similar to how I do it for aov(). > >But then, all of a sudden, for the factorial experiment the design matrix >not only has -1's in there, but also a column for the interactions. How >do I decide which array/factor combination gets a 1, a 0 or a -1? > >Let me put this in perspective. I have a 3 factor experiment where the >factors are animal, infected/uninfected and time. All samples were >hybridised against a common reference. For analysis of variance, all I do >is set up a data.frame that looks like this for each gene: > > data c b t >1 2.9 1 1 1 >2 2.7 1 0 2 >3 2.8 1 1 1 >4 3.0 1 0 2 >5 -3.0 0 1 1 >6 -3.5 0 0 2 >7 -4.0 0 1 1 >8 -5.0 0 0 2 > >where data is my data, and c, b and t are my factors, and then feed in >something like: > >(aov.aov <- aov(data ~ c*b*t, aov.data)) > >and I get F-statistics for c, b and t and all possible interactions. > >Because of the limitations of analysis of variance for my microarray data, >I would like to use limma. Is there any *more* documentation I can look >at that will tell me the steps to take to work out what my limma design >matrix will look like? > >Kind regards > >Michael

GO limma PROcess weaver GO limma PROcess weaver • 2.6k views

ADD COMMENT • link 21.0 years ago Gordon Smyth 52k