Question: Setting up the Design Matrix
gravatar for Yogender Gowtham
3.9 years ago by
Yogender Gowtham20 wrote:

Hello All



I am trying to use edgeR to identify differentially expressed genes in my experiment involving three factors - 2 cell lines (Say A and B), 2 temperatures (t1 and t2) and 2 pH levels (p1 and p2). Each condition was conducted in duplicates and hence there are 16 samples in all. Since each of the factors could influence the outcome, I wanted to include all of them in my design matrix. 

The questions I would like to answer are:

1) see the effect of temperature between t1 and t2, irrespective of the cell lines. 

2) See the effect of pH between p1 and p2, irrespective of the cell lines.

3) Understand the differences between the cell lines. 

4) Obtain the 2-way and 3-way interaction effects between the factors. 

My code for Design Matrix is like this (Code is in bold):

celltype <- factor(samples$ct, levels = c("A", "B"))
temp <- factor(samples$trt2, levels = c("t1", "t2"))
pH <- factor (samples$trt1, levels = c( "p1", "p2"))
design <- model.matrix(~celltype+temp+pH)
rownames(design) <- samples$file

                                 (Intercept) celltypeB tempt2 pHP2
Sample 1(B_t2_p2_1)              1            1       1       1
Sample2(B_t2_p2_2)             1            1       1       1
Sample3(B_t2_p1_1)            1            1       1       0
Sample4(B_t2_p1_2)            1            1       1       0
Sample5(A_t2_p2_1)             1            0       1       1
Sample6(A_t2_p2_2)             1            0       1       1
Sample7(A_t2_p1_1)           1            0       1       0
Sample8(A_t2_p1_2)           1            0       1       0
Sample9(B_t1_p1_1)            1            1       0       0
Sample10(B_t1_p1_2)            1            1       0       0
Sample11(B_t1_p2_1)              1            1       0       1
Sample12(B_t1_p2_2)              1            1       0       1
Sample13(A_t1_p2_1)             1            0       0       1
Sample14(A_t1_p2_2)             1            0       0       1
Sample15(A_t1_p1_1)           1            0       0       0
Sample16(A_t1_p1_2)           1            0       0       0

[1] 0 1 2 3
[1] "contr.treatment"

[1] "contr.treatment"

[1] "contr.treatment"

cds <- estimateGLMCommonDisp(cds, design, verbose=T) 
#Disp = 0.01078 , BCV = 0.1038 

cds2 <- estimateGLMTrendedDisp(cds, design) 
cds2 <- estimateGLMTagwiseDisp(cds2, design)

glmfit <- glmFit(cds2, design)

de <- glmLRT(glmfit, coef = 3)
tt <- topTags(de, n=nrow(cds2))


If I understand correctly, coef=3 should give me the effect of temperature - irrespective of cell lines and pH. Is my design correct? And will it be sufficient to answer my questions? 




ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by Yogender Gowtham20
Answer: Setting up the Design Matrix
gravatar for Yunshun Chen
3.9 years ago by
Yunshun Chen540
Yunshun Chen540 wrote:

Your design looks correct and testing on coef=3 gives you the overall temperature effect. The first three questions can be answered by using this design. However, an interactive model (design) is required in order to test for the interaction effects between factors.


ADD COMMENTlink written 3.9 years ago by Yunshun Chen540

Thank you Dr. Chen. Glad to know my first design is correct. I am not entirely sure on how to code for a three-way interaction model. 

design <- model.matrix(~celltype*temp*pH)

Will this model allow me to perform? This is the output I got when I tried the above model: 



  (Intercept) celltypeB tempT2 pHP2 celltypeB:tempT2 celltypeB:pHP2 tempT2:pHP2 celltypeB:tempT2:pHP2
Sample1 1 1 1 1 1 1 1 1
Sample2 1 1 1 1 1 1 1 1
Sample3 1 1 1 0 1 0 0 0
Sample4 1 1 1 0 1 0 0 0
Sample5 1 0 1 1 0 0 1 0
Sample6 1 0 1 1 0 0 1 0
Sample7 1 0 1 0 0 0 0 0
Sample8 1 0 1 0 0 0 0 0
Sample9 1 1 0 0 0 0 0 0
Sample10 1 1 0 0 0 0 0 0
Sample11 1 1 0 1 0 1 0 0
Sample12 1 1 0 1 0 1 0 0
Sample13 1 0 0 1 0 0 0 0
Sample14 1 0 0 1 0 0 0 0
Sample15 1 0 0 0 0 0 0 0
Sample16 1 0 0 0 0 0 0 0

Please advise. Thanks


ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Yogender Gowtham20

That's correct, though a simpler (but equivalent) approach is to reparametrize it as a one-way layout:

grouping <- factor(paste0(celltype, temp, pH, sep="."))
design <- model.matrix(~0 + grouping)
colnames(design) <- levels(grouping)

You can then test for DE between any combinations of factors by specifying the groups in makeContrasts.

ADD REPLYlink written 3.9 years ago by Aaron Lun25k

Thanks Aaron. Yes, I actually made a model with simple design (exactly like you mentioned) but since my experiment has multiple factors that could affect the gene expression, I wanted to include them in the design.

Anders 2013

Design matrix. For more complex designs (i.e., beyond two-group comparisons), users need to provide a design matrix that specifies the factors that are expected to affect expression levels. As mentioned above, GLMs can be used to analyze arbitrarily complex experiments, and the design matrix is the means by which the experimental design is described mathematically, including both biological factors of interest and other factors not of direct interest, such as batch effects. For example, Section 4.5 of the edgeR User’s Guide (‘RNA-seq of pathogen inoculated Arabidopsis with batch effects’) ... presents worked case studies with batch effects. The design matrix is central for such complex differential expression analyses, and users may wish to consult a linear modeling textbook39 or a local statistician to make sure their design matrix is appropriately specified. 

The difference we have seen is in the overestimation of differentially expressed genes. A contrast, similar to coef=3, indicated around 8000 genes to be affected by temperature.  We are just looking to reduce the type I errors. 

ADD REPLYlink written 3.9 years ago by Yogender Gowtham20

The interaction model and one-way layout are different parametrizations of the same linear system. I would be very surprised if they gave different estimates or p-values at any stage of the DE analysis. The more likely explanation is that the contrast you've set up for the one-way layout is not equivalent to dropping coef=3 in the interaction model. See my comments on interpreting main effects in the presence of interaction terms.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Aaron Lun25k

It is quite possible that my contrast was setup wrongly with the simpler one-way layout. 

I saw your comments on how interpreting coef=3 as main effect can be misleading. So I suppose there is no way of knowing the main effect given the presence of interaction terms. So coef=3 gives me the DGE due to temperature differences in cell type A. How can I find the differences in cell type B? I did see in the watershed example, you had contrasts between group 2 and group 4 to find differences between RS and RL. If I transfer the idea here, would that mean 

con1 <- (0,0,1,0,1,0,0,0) will give me differences due to temperature in samples 3 and 4 in comparison to 9,10?  


Con2 <- (0,1,0,0,0,0,0,0) will give differences due to cell type in samples 9,10 wrt samples 15 and 16?


ADD REPLYlink written 3.9 years ago by Yogender Gowtham20

You're right. The two contrasts give you the results for the two comparisons of your interest.

It is still possible to test for the main effect of temperature under the interaction model. However, the contrast is not as intuitive as what you would have using the one-way layout. The contrast would be c(0,0,1,0,0.5,0,0.5,0.25) under the interaction model.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Yunshun Chen540

Hello Yushun, could you please elaborate how the above contrast to find the main effect of temperature work? 


ADD REPLYlink written 3.9 years ago by Yogender Gowtham20

Think of it this way; samples 1-8 are in temperature T2, and samples 9-16 are in the other temperature. You want to take the average of the first 8 samples against the average of the last 8 samples to get the temperature effect, averaged across all other factors. You can do that by doing:

colMeans(design[1:8,]) - colMeans(design[9:16,])

... which gives you the same contrast vector as Andy has described. Note that this is an average across samples, rather than across groups; they're the same here, but there will be a difference in other situations where you don't have the same number of replicates in each group. I usually use an average across groups but the average across samples is easier to explain here (and is the same, so no harm done).

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Aaron Lun25k

Hello Yushun, could you please elaborate how the above contrast to find the main effect of temperature work? 


ADD REPLYlink written 3.9 years ago by Yogender Gowtham20

Also, with the earlier design when I input: de <- glmLRT(glmfit, coef = 3), I got 7108 genes to be differentially expressed with FDR<0.01.

However with the interaction model, the same input: de <- glmLRT(glmfit, coef = 3), I get 5280 genes to be differentially expressed with FDR<0.01. Could you please help me understand the differences and which model is statistically better to implement? 


ADD REPLYlink written 3.9 years ago by Yogender Gowtham20

Check out the discussion following this post:

Different results in edgeR using simple vs GLM

Basically, in your interaction model, coef=3 no longer represents the overall temperature effect. Rather, it represents the effect of temperature in samples 7 and 8 relative to samples 15 and 16. For all other samples, dropping the third coefficient will have no effect, as any DE due to temperature will get soaked up by the interaction terms. The problem here is that you can only test main effects once you know that the interaction terms are not significant (and once those terms are removed from the model).

In the context of your analysis, you were implicitly assuming that the interaction terms were non-significant when you fitted an additive model; that's why you can test for main effects. This will result in more DE genes as you're using information from all samples. In contrast, when you drop coef=3 in the interaction model, you're only testing for DE between samples 7-8 and 15-16. Fewer samples means less power (you also have fewer residual d.f. to estimate the dispersion) and less DE.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Aaron Lun25k

Thank you once again. The discussion in that post was informative. I do now understand the differences in estimating the differentially expressed genes was due to higher degrees of freedom available for the earlier model mentioned where I am just looking at the main effects. Now, through the latter model I have established all interaction terms, except for the interaction of cell type and temperature, are not significant. Can I develop a model that will provide me with the main effects of temperature, pH and cell line while allowing to look at the cell type and temperature interaction alone? From the discussion, I gather that when you include the interactions terms, then the main effects can no longer be interpreted. So what can I do?

Also, I am not sure if I understand why the model represents only samples 7,8, 15 and 16 when the coef=3 in the latter model? Is this dependent on what is being represented by the intercept? 


ADD REPLYlink written 3.9 years ago by Yogender Gowtham20

I'll answer your second question first; figure out what samples have identical rows when you drop the third coefficient. That defines the null hypothesis for your test. In this case, samples 7, 8, 15 and 16 are all the same if you ignore the third coefficient. Thus, the third coefficient represents the difference between 7/8 and 15/16.

As for your first question; in more traditional applications, people would fit the full model, test the interaction terms, toss them out if they weren't significant, and repeat with the reduced model until you got to the main effects. Unfortunately, this is harder to do here because we're fitting a model for every gene. If the interaction terms are significant for some genes, do we toss them out and refit the model, or leave them in? It's hard to say, especially given that tossing out significant terms would result in an inflated dispersion for a gene. This would result in higher dispersions for all genes via empirical Bayes shrinkage, potentially resulting in loss of power for all genes in the entire data set.

In such cases, I usually fit a model with all relevant interaction terms. I then change the contrasts to get something equivalent to a test for a main effect. You can do something similar to Andy's Yunshun's example, which effectively tests for DE between the average of all T2 samples against the average of all non-T2 samples. Alternatively, you can do an ANOVA-like comparison, e.g., testing for any changes between B.T2 - B.T1 or between A.T2 - A.T1. You can then select significant genes where the temperature-induced DE is in the same direction for both cell types.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Aaron Lun25k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 168 users visited in the last hour