Question: Design question limma multi variables
0
2.9 years ago by
Stane40
Stane40 wrote:

Hello,

I have been working on microarrays using R and Limma for differential gene expression analysis. My current design is fairly simple as I am just using two class "control" and "treatment" and I am only interesting in the DE genes between control and treatment.

design <- model.matrix(~cell_class, data)


But the data also contains different cell lines (around 7), and treatment methods (4), so I have been wondering if it would be better and more precise to use another design like so:

design <- model.matrix(~cell_class + cell_lines + treatment_methods, data)


Both designs lead to very similar output when calling topTable, almost all DE genes are the same but with differences in fold change. I have been searching in limma documentation chap 9, 9.5, 9.7 are quite similar to my question but not exactly the same as they seems to be interested in contrast between sub groups while I am more interested in the global control vs treatment.

ps: I am not posting the design matrix output as it is a little over 300 rows and would not be very usefull

limma design matrix • 1.5k views
modified 2.9 years ago by Aaron Lun25k • written 2.9 years ago by Stane40

If you have a control and 4 treatment methods, doesn't that make five classes, not 2?

Thanks but I wouldn't ask if it was that simple, well I am not interested in the slight difference between treatments, my interest is in the general picture of the cell control vs treated(any treatment method)

Let suppose I didn't mention the treatment_methods and only add the cell_lines information to the second design would it be more accurate in this case compare to the first one even if I am only using the cell_class binary factor ( control/treated) as coef in topTable ?

This post would benefit from a small-scale example. Are all classes/treatments present in each cell line? Are the treatment types nested within the treatment class, or are there specific controls for each treatment method? Currently I am imagining your situation below, but I can't tell if this is actually the case, and it's difficult to give a precise answer without details:

treatment_method <- factor(rep(0:4, 7)) # Control is "treatment 0"
cell_lines <- factor(rep(LETTERS[1:7], each=4)) # Cell lines A-G
cell_class <- factor(treatment_method==0) 

Thank you for taking the time Aaron, you resumed pretty well the situation in your code, I just simplified and I changed it a little to reflect the fact that I have a least 2 control per cell line, some cell lines are more represented than others and a few cell line have more than one treatment method.

cell_lines <- factor(c(rep(LETTERS[1:4], each=4),rep('D',4))) # Cell line A to D
treatment_method <- factor(c(0,0,1,1,0,0,1,1,0,0,2,2,0,0,2,2,0,0,1,1)) # Most cell lines contain 1 treatment method, control is 0
cell_class <- factor(treatment_method == 0) # the contrast I am mostly interest in

Answer: Design question limma multi variables
0
2.9 years ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

This is the design I would use:

group <- factor(paste0(cell_lines, treatment_method))
design <- model.matrix(~0 + group)
colnames(design) <- levels(group)

To test for an average effect of the treatment, I would set up a contrast like this:

con <- makeContrasts(A1 - A0, B1 - B0, C2 - C0, (D1 + D2)/2 - D0,
levels=design)
con <- rowMeans(con)

... which will test whether the average treatment effect across all cell lines and treatment methods is significantly different from zero.

The above approach has a number of advantages over the additive model. For starters, it is simpler to interpret and you can perform any comparisons between any groups. It allows for different treatment effects between different cell lines. (Presumably, whoever designed the experiment thought that these were potentially relevant factors, otherwise they wouldn't have bothered to spend money in testing them.) In contrast, the additive model assumes that the treatment effect is constant across all cell lines - if this is not true, the variance will be inflated and your detection power for DE genes will be reduced.

An additional point is that your design matrix for the additive model is not of full rank (you should have gotten a warning from lmFit saying that some coefficients were not estimable). This means that the coefficients require some care in their interpretation - they are likely to not be what you think they are. In my tests with your toy example (see your comment, and using the additive formula in your original post), it seems that treatment_method2 is not estimated . This means that the cell_classTRUE coefficient actually represents the effect of treatment 2 over control, while the treatment_method1 coefficient will represent the log-fold change of treatment 1 over treatment 2. So, dropping cell_classTRUE wouldn't be testing for any global effect of treatment.

Thank you so much Aaron, I am starting to understand :) concerning the toy example additive model, do you say that the cell_classTrue is the effect of treatment2 over control only as the row with 1 exactly match the row with 1 in treatment2 and not treatment1 ? in my additive design with all the data I noticed that the row with 1 in cell_classTRUE alternate almost like the 1 in each treatment.

Also I just tested a contrast based on your example and I am getting different output compare to my very simple design:

design <- model.matrix(~cell_class)
fit <- lmFit(gse, design)
fit <- eBayes(fit)
tT <- topTable(fit, coef = "cell_classTRUE",
adjust="fdr", sort.by="logFC", number=Inf)

New design

group <- factor(paste0(cell_lines, treatment_method))
design <- model.matrix(~0 + group)
colnames(design) <- levels(group)
con <- makeContrasts(A1 - A0, B1 - B0, C2 - C0, (D1 + D2)/2 - D0,
levels=design)
con <- rowMeans(con)
fit <- lmFit(gse, design)
fit2 <- contrasts.fit(fit, con)
fit2 <- eBayes(fit2)
tT <- topTable(fit2, adjust="fdr", sort.by="logFC", number=Inf)

Should I forget the old one and keep the new one ? and is it con <- rowMeans(con) which makes the contrast focusing on all treatment vs all control?

design <- model.matrix(~cell_class + cell_lines + treatment_method)

... are not interpretable due to linear dependencies between cell_classTRUE, treatment_method1 and treatment_method2. If you want to interpret each coefficient, you have to discard one of them. limma chooses (more-or-less arbitrarily, in this case) to discard treatment_method2 for you, as lmFit doesn't try to estimate it.

Now, consider samples for cell line D. (I've chosen this cell line because it has both treatment conditions, which makes my point easy to demonstrate.) We describe the systematic component for each type of sample (i.e., control/treatment 1/treatment 2), based on whether the coefficient has a value of 1 or 0 in the design matrix for the row corresponding to that type of sample:

Control = Intercept + cell_classTRUE + cell_linesD
Treatment_1 = Intercept + treatment_method1 + cell_linesD
Treatment_2 = Intercept + cell_linesD

It should be fairly obvious that cell_classTRUE represents the difference (i.e., the log-fold change) between control and treatment 2, while treatment_method1 represents the difference between treatment 1 and treatment 2.

To answer your second question; probably yes, because your old model (using only cell_class) is likely to be flawed. I don't know what treatment methods you're using, but I find it hard to believe that different cell lines would not have systematic differences in behaviour. These differences will inflate the variance estimates in your old model.

To answer your third question; try to figure out what the values of the contrast vector mean. I get:

    A0     A1     B0     B1     C0     C2     D0     D1     D2
-0.250  0.250 -0.250  0.250 -0.250  0.250 -0.250  0.125  0.125


... which can also be generated from this:

con <- makeContrasts((A1 + B1 + C2 + (D1+D2)/2)/4 - (A0 + B0 + C0 + D0)/4,
levels=design)

In other words, this contrast compares the average of all treatment samples to the average of all control samples. I've averaged the two different treatment types in cell line D so that each cell line contributes the same number of coefficients to the contrast.

Thank you so much for your explanation you really are a great help. I understood the second part concerning the contrast, it is additional coefficients that lmFit apply to the related column off the design. And the following contrast which I find more clear:

con <- makeContrasts((A1 + B1 + C2 + (D1+D2)/2)/4 - (A0 + B0 + C0 + D0)/4,
levels=design) 

is equivalent to

con <- makeContrasts(A1 - A0, B1 - B0, C2 - C0, (D1 + D2)/2 - D0,
levels=design)
con <- rowMeans(con)

However concerning the additive design matrix, do you say it is not interpretable due to the fact that there are more than one number 1 per row? I search the code and find that lmFit is calling in cascade:

MASS:rlm -> stats::lm.wfit -> c Cdqrls -> fortran dqrls to solve the linear model with least square solution

So from the documentation and the code, I kinda understood that lmfit is finding and fitting a linear model for each gene using the design coefficient and gene expression value. But I am still confused, how do you compute the systematic components?

Control = Intercept + cell_classTRUE + cell_linesD
Treatment_1 = Intercept + treatment_method1 + cell_linesD
Treatment_2 = Intercept + cell_linesD

is the + in your formula a binary operator? I thought that cell_classTRUE was the control because of:

cell_class <- factor(treatment_method == 0)

and what mean Treatment_1/2 is it not the same as treatment_method1/2

I feel like I am missing something obvious and I don't want to make you loose to much time with this, I am going to try to catch up reading limma papers :)