Question: Limma Model Design Question
0
4.3 years ago by
United Kingdom
andrew.j.skelton73320 wrote:

Hi,

I have an experiment that contains 3 treatments and a technical effect consisting of three batches.

treatment <- c("a","a","a","b","b","b","c","c","c")

batch <-c("i","ii","iii","i","ii","iii","i","ii","iii")

I realise that looks like a paired sample design, it's just for simplicity.

design <- model.matrix(~0 + as.factor(treatment) + as.factor(batch))

When I run the above code, I get a design matrix with 5 column names, all three of my treatments, then only two of my batches. I'm not sure why this is happening, I assumed that because I've added zero at the beginning of my model design, that stops terms from being absorbed into the intercept, can anyone shed any light on why I'm only seeing 5 terms?

I realise I could add in the batch as an "extra" variable using:

design <- model.matrix(~0 + as.factor(treatment), as.factor(batch))

but ultimately, I'd like to include multiple variables that could account for technical variation.

Cheers

limma limma design matrix • 1.0k views
modified 4.3 years ago by James W. MacDonald51k • written 4.3 years ago by andrew.j.skelton73320
6
4.3 years ago by
United States
James W. MacDonald51k wrote:

The other batch is absorbed in to the treatments. You can think of the coefficients in your model as 'adjustments' that make things comparable. The treatments are the means of each treatment for batch 'i', and the batch coefficients are the difference between say, batch ii and batch i, which then adjusts the data for batch ii (by subtracting out the mean difference between batch ii and batch i) so that you can compare data between batches.

Algebraically, for a given gene in batch i, you are saying

Gene expression = mean(gene expression, batch i) + error

and for the same gene in batch ii you are saying

Gene expression = mean(gene expression, batch i) + mean(batch ii - batch i) + error

Where you capture (and remove) the mean difference between the two batches in the second coefficient, to allow for inter-batch comparisons. The same thing is true for batch iii.

Edit: This doesn't necessarily allow for inter-batch comparisons. Instead, it is 'controlling for the batch effect', thus allowing you to make comparisons between treatments without having to worry that the batch effect is biasing your results.

1

Indeed, if your design were to have all three batch terms and all three treatments, it would not be of full rank:

> bad.design <- cbind(model.matrix(~0+treatment), model.matrix(~0+batch))
[1] "treatmenta" "treatmentb" "treatmentc" "batchi"     "batchii"
[6] "batchiii"
[1] 5

There are linear dependencies between your columns in this design matrix. More specifically, the row sums of the first 3 columns in bad.design gives the same vector (all-ones) as the row sums of the last 3 columns. This means that you'd have infinite solutions for the fitted value vector; you could just increase the estimates of the first 3 coefficients by any value, so long as you decrease the estimates of the last three by the same value. Obviously, this is not ideal, as we want one solution.

James, Aaron, thanks for your explanations. That helps clear things up. So if I were to make contrasts and do an lm fit, as per below, that should look at 'a' against 'b', but control for the batch effect?

design   <- model.matrix(~0 + as.factor(treatment) + as.factor(batch))
fit      <- lmFit(normalised_data, design)
cont_mat <- makeContrasts(a-b)
fit2     <- contrasts.fit(fit, contrasts=cont_mat)
fit2     <- eBayes(fit2)
1

That is correct.

The gist of your code is correct. However, you should set levels=design in the call to makeContrasts, and rename the first three columns of design to c('a','b','c').