Hi,
I have an RNA-seq experiment that is similar to section 3.5 in the edgeR user’s guide, i.e. a nested design, and I have used this approach to analyze my own data. Briefly, the experiment involves 60 RNA samples, corresponding to two groups of bacterial strains (commensal (C) and disease-causing (D)); each group consists of 15 different strains; either strain induced (IND) and in the control (CTR) condition. I am now a bit uncertain about how to interpret the design matrix I am using for this experiment when calculating differential expression and would very much appreciate some help. The data frame I am using is as follows:
> data.frame(group, strain, condition)
group strain condition
1 C 1 CTR
2 C 1 IND
3 C 2 CTR
4 C 2 IND
5 C 3 CTR
6 C 3 IND
7 C 4 CTR
8 C 4 IND
9 C 5 CTR
10 C 5 IND
11 C 6 CTR
12 C 6 IND
13 C 7 CTR
14 C 7 IND
15 C 8 CTR
16 C 8 IND
17 C 9 CTR
18 C 9 IND
19 C 10 CTR
20 C 10 IND
21 C 11 CTR
22 C 11 IND
23 C 12 CTR
24 C 12 IND
25 C 13 CTR
26 C 13 IND
27 C 14 CTR
28 C 14 IND
29 C 15 CTR
30 C 15 IND
31 D 1 CTR
32 D 1 IND
33 D 2 CTR
34 D 2 IND
35 D 3 CTR
36 D 3 IND
37 D 4 CTR
38 D 4 IND
39 D 5 CTR
40 D 5 IND
41 D 6 CTR
42 D 6 IND
43 D 7 CTR
44 D 7 IND
45 D 8 CTR
46 D 8 IND
47 D 9 CTR
48 D 9 IND
49 D 10 CTR
50 D 10 IND
51 D 11 CTR
52 D 11 IND
53 D 12 CTR
54 D 12 IND
55 D 13 CTR
56 D 13 IND
57 D 14 CTR
58 D 14 IND
59 D 15 CTR
60 D 15 IND
The design matrix (modeled after the example in 3.5 of the manual) has group as the main effect, with strain and condition nested within the groups:
design <- model.matrix(~group+group:strain+group:condition)
colnames(design)
[1] "(Intercept)" "groupD" "groupC:strain2"
[4] "groupD:strain2" "groupC:strain3" "groupD:strain3"
[7] "groupC:strain4" "groupD:strain4" "groupC:strain5"
[10] "groupD:strain5" "groupC:strain6" "groupD:strain6"
[13] "groupC:strain7" "groupD:strain7" "groupC:strain8"
[16] "groupD:strain8" "groupC:strain9" "groupD:strain9"
[19] "groupC:strain10" "groupD:strain10" "groupC:strain11"
[22] "groupD:strain11" "groupC:strain12" "groupD:strain12"
[25] "groupC:strain13" "groupD:strain13" "groupC:strain14"
[28] "groupD:strain14" "groupC:strain15" "groupD:strain15"
[31] "groupC:conditionIND" "groupD:conditionIND"
Fit GLM to each feature given calculated dispersions and design matrix:
fit <- glmFit(d, design)
Then calculate differential expression within groups using the following coefficients (please correct me if this is wrong):
1. Genes responding to induction in any group:
DE1 <- glmLRT(fit, coef=31:32)
2. Genes responding to induction in group C:
DE2 <- glmLRT(fit, coef=31)
3. Genes responding to induction in group D:
DE3 <- glmLRT(fit, coef=32)
I am uncertain about comparing between groups using this design matrix. If I understand correctly, according to the example in 3.5 I can compare between groups to find:.
4. Genes responding differently to induction in group D vs group C using this contrast:
DE4 <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,1))
Can I also use this design matrix to compare between groups to find (and if so, using which coefficients/contrasts):
5. Genes responding differently to the control condition in group D vs group C?
6. Genes responding differently to any condition in group D vs group C?
Or would I need to modify the design matrix/do a separate analysis to compare between groups in this case?
Any help would be much appreciated.