Question

EdgeR:Calculating differential expression within/between groups in nested designs

0

Entering edit mode

Mauve • 0

@mauve-7320

Last seen 9.2 years ago

Norway

Hi,

I have an RNA-seq experiment that is similar to section 3.5 in the edgeR user’s guide, i.e. a nested design, and I have used this approach to analyze my own data. Briefly, the experiment involves 60 RNA samples, corresponding to two groups of bacterial strains (commensal (C) and disease-causing (D)); each group consists of 15 different strains; either strain induced (IND) and in the control (CTR) condition. I am now a bit uncertain about how to interpret the design matrix I am using for this experiment when calculating differential expression and would very much appreciate some help. The data frame I am using is as follows:

> data.frame(group, strain, condition)
   group strain condition
1   C      1       CTR
2   C      1       IND
3   C      2       CTR
4   C      2       IND
5   C      3       CTR
6   C      3       IND
7   C      4       CTR
8   C      4       IND
9   C      5       CTR
10 C      5       IND
11 C      6       CTR
12 C      6       IND
13 C      7       CTR
14 C      7       IND
15 C      8       CTR
16 C      8       IND
17 C      9       CTR
18 C      9       IND
19 C     10       CTR
20 C     10       IND
21 C     11       CTR
22 C     11       IND
23 C     12       CTR
24 C     12       IND
25 C     13       CTR
26 C     13       IND
27 C     14       CTR
28 C     14       IND
29 C     15       CTR
30 C     15       IND
31   D      1       CTR
32   D      1       IND
33   D      2       CTR
34   D      2       IND
35   D      3       CTR
36   D      3       IND
37   D      4       CTR
38   D      4       IND
39   D      5       CTR
40   D      5       IND
41   D      6       CTR
42   D      6       IND
43   D      7       CTR
44   D      7       IND
45   D      8       CTR
46   D      8       IND
47   D      9       CTR
48   D      9       IND
49   D     10       CTR
50   D     10       IND
51   D     11       CTR
52   D     11       IND
53   D     12       CTR
54   D     12       IND
55   D     13       CTR
56   D     13       IND
57   D     14       CTR
58   D     14       IND
59   D     15       CTR
60   D     15       IND

The design matrix (modeled after the example in 3.5 of the manual) has group as the main effect, with strain and condition nested within the groups:

design <- model.matrix(~group+group:strain+group:condition)
colnames(design)
[1] "(Intercept)"            "groupD"               "groupC:strain2"
[4] "groupD:strain2"       "groupC:strain3"      "groupD:strain3"
[7] "groupC:strain4"      "groupD:strain4"       "groupC:strain5"
[10] "groupD:strain5"       "groupC:strain6"      "groupD:strain6"
[13] "groupC:strain7"      "groupD:strain7"       "groupC:strain8"
[16] "groupD:strain8"       "groupC:strain9"      "groupD:strain9"
[19] "groupC:strain10"     "groupD:strain10"      "groupC:strain11"
[22] "groupD:strain11"      "groupC:strain12"     "groupD:strain12"
[25] "groupC:strain13"     "groupD:strain13"      "groupC:strain14"
[28] "groupD:strain14"      "groupC:strain15"     "groupD:strain15"
[31] "groupC:conditionIND" "groupD:conditionIND"

Fit GLM to each feature given calculated dispersions and design matrix:
fit <- glmFit(d, design)

Then calculate differential expression within groups using the following coefficients (please correct me if this is wrong):
1.   Genes responding to induction in any group:
       DE1 <- glmLRT(fit, coef=31:32)
2.   Genes responding to induction in group C:
       DE2 <- glmLRT(fit, coef=31)
3.   Genes responding to induction in group D:
       DE3 <- glmLRT(fit, coef=32)

I am uncertain about comparing between groups using this design matrix. If I understand correctly, according to the example in 3.5 I can compare between groups to find:.

4. Genes responding differently to induction in group D vs group C using this contrast:

DE4 <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,1))

Can I also use this design matrix to compare between groups to find (and if so, using which coefficients/contrasts):
5. Genes responding differently to the control condition in group D vs group C?
6. Genes responding differently to any condition in group D vs group C?
Or would I need to modify the design matrix/do a separate analysis to compare between groups in this case?

Any help would be much appreciated.

edger nested design matrix differential expression • 2.5k views

ADD COMMENT • link updated 9.2 years ago by Aaron Lun ★ 28k • written 9.2 years ago by Mauve • 0

score 1 · Accepted Answer · 2015-02-23

Your design looks good, though keep in mind that it assumes that a gene will have the same response to induction in all strains. As for your contrasts 1 - 4, they look sensible.

Your contrast 5 is a bit odd; I'm not sure what you mean by "respond differently", as it's a control sample, so what's it responding to? I'll assume that you're looking for DE genes between the control samples of the two groups. This is complicated by the blocking on strain in the design. Instead, you could try to find DE genes between groups within each strain. For example, the genes that have the differences in control expression across groups C and D in strain 2 can be discovered with:

DE5 <- glmLRT(fit, contrast=c(0,1,-1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))

You can then do this for all strains, taking the average log-fold change to get the overall difference between groups:

DE5.2 <- glmLRT(fit, contrast=c(0,15,-1,1,-1,1,-1,1,-1,1,-1,1,-1,1,-1,1,-1,1,-1,1,-1,1,-1,1,-1,1,-1,1,-1,1,0,0)/15)

The same logic can be applied for the induced condition. In the case of strain 2, we have:

DE6 <- glmLRT(fit, contrast=c(0,1,-1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,1))

and this can simply be averaged as before over all strains. You can then set up an ANOVA-like contrast by cbind'ing the contrasts for the control and induced conditions.

Of course, you could always specify another design matrix where there is no consideration of strain-specific expression in the control samples for each group. Were I to do this, I might use:

design2 <- model.matrix(~group + strain:condition:group)
design2 <- design2[,-(3:32)]

This assumes that control samples behave the same between strains within each group, but allows for strain-specific responses to induction in each group. This will simplify contrast 5 as you can just drop the group coefficient to compare control samples. However, it will complicate contrasts 1 - 4, as there is no longer any coefficient representing the induction effect for all strains.