Hi All
I am trying to find differentially expressed genes with edgeR, while simultaneously accounting for batch effects. Unfortunately, the batches completely encompass the samples, so that if I specify the batches and the samples in my edgeR design matrix, the system is over-specified (i.e. there are infinite optimizations of the GLM, I get the "Design matrix not of full rank" error). My data represent a time course where I have 2 different ages of mice, 3 cell types, and 7 time points (2*3*7=42). Each of these was done in two biological replicates, so I have 42*2=84 distinct replicates. I want to consider each mouse a batch, so there are 2*2 (replicates * old/young) different batches. Actually, a few samples are missing below, but this shouldn't affect anything.
> runData
id Batch Group
1 young_LT_0_1 young_BR1 young_LT_0
2 young_LT_0_2 young_BR2 young_LT_0
3 young_LT_0.5_1 young_BR1 young_LT_0.5
4 young_LT_0.5_2 young_BR2 young_LT_0.5
5 young_LT_1_1 young_BR1 young_LT_1
6 young_LT_1_2 young_BR2 young_LT_1
7 young_LT_12_2 young_BR2 young_LT_12
8 old_LT_12_2 old_BR2 old_LT_12
9 young_LT_2_1 young_BR1 young_LT_2
10 young_LT_2_2 young_BR2 young_LT_2
11 old_LT_2_2 old_BR2 old_LT_2
12 young_LT_4_1 young_BR1 young_LT_4
13 young_LT_4_2 young_BR2 young_LT_4
14 young_LT_8_1 young_BR1 young_LT_8
15 young_LT_8_2 young_BR2 young_LT_8
16 young_MPP_0_1 young_BR1 young_MPP_0
17 young_MPP_0_2 young_BR2 young_MPP_0
18 old_MPP_0_2 old_BR2 old_MPP_0
19 young_MPP_0.5_1 young_BR1 young_MPP_0.5
20 young_MPP_0.5_2 young_BR2 young_MPP_0.5
21 old_MPP_0.5_2 old_BR2 old_MPP_0.5
22 young_MPP_1_1 young_BR1 young_MPP_1
23 young_MPP_1_2 young_BR2 young_MPP_1
24 old_MPP_1_2 old_BR2 old_MPP_1
25 young_MPP_12_1 young_BR1 young_MPP_12
26 young_MPP_12_2 young_BR2 young_MPP_12
27 old_MPP_12_2 old_BR2 old_MPP_12
28 young_MPP_2_1 young_BR1 young_MPP_2
29 young_MPP_2_2 young_BR2 young_MPP_2
30 old_MPP_2_2 old_BR2 old_MPP_2
31 young_MPP_4_1 young_BR1 young_MPP_4
32 young_MPP_4_2 young_BR2 young_MPP_4
33 old_MPP_4_2 old_BR2 old_MPP_4
34 young_MPP_8_1 young_BR1 young_MPP_8
35 young_MPP_8_2 young_BR2 young_MPP_8
36 old_MPP_8_2 old_BR2 old_MPP_8
37 young_ST_0_1 young_BR1 young_ST_0
38 young_ST_0_2 young_BR2 young_ST_0
39 old_ST_0_2 old_BR2 old_ST_0
40 young_ST_0.5_1 young_BR1 young_ST_0.5
41 young_ST_0.5_2 young_BR2 young_ST_0.5
42 old_ST_0.5_2 old_BR2 old_ST_0.5
43 young_ST_1_2 young_BR2 young_ST_1
44 old_ST_1_2 old_BR2 old_ST_1
45 young_ST_12_1 young_BR1 young_ST_12
46 young_ST_12_2 young_BR2 young_ST_12
47 old_ST_12_2 old_BR2 old_ST_12
48 young_ST_2_1 young_BR1 young_ST_2
49 young_ST_2_2 young_BR2 young_ST_2
50 old_ST_2_2 old_BR2 old_ST_2
51 young_ST_4_1 young_BR1 young_ST_4
52 young_ST_4_2 young_BR2 young_ST_4
53 old_ST_4_2 old_BR2 old_ST_4
54 young_ST_8_1 young_BR1 young_ST_8
55 young_ST_8_2 young_BR2 young_ST_8
56 old_ST_8_2 old_BR2 old_ST_8
57 old_LT_4_1 old_BR1 old_LT_4
58 old_LT_8_1 old_BR1 old_LT_8
59 old_LT_12_1 old_BR1 old_LT_12
60 old_LT_0_1 old_BR1 old_LT_0
61 old_LT_0.5_1 old_BR1 old_LT_0.5
62 old_LT_1_1 old_BR1 old_LT_1
63 old_LT_2_1 old_BR1 old_LT_2
64 old_MPP_4_1 old_BR1 old_MPP_4
65 old_MPP_8_1 old_BR1 old_MPP_8
66 old_MPP_12_1 old_BR1 old_MPP_12
67 old_MPP_0_1 old_BR1 old_MPP_0
68 old_MPP_0.5_1 old_BR1 old_MPP_0.5
69 old_MPP_1_1 old_BR1 old_MPP_1
70 old_MPP_2_1 old_BR1 old_MPP_2
71 old_ST_4_1 old_BR1 old_ST_4
72 old_ST_8_1 old_BR1 old_ST_8
73 old_ST_12_1 old_BR1 old_ST_12
74 old_ST_0_1 old_BR1 old_ST_0
75 old_ST_0.5_1 old_BR1 old_ST_0.5
76 old_ST_1_1 old_BR1 old_ST_1
77 old_ST_2_1 old_BR1 old_ST_2
If I specify all batches and all groups in the design matrix, it is over-specified even without an intercept because I can infer the value of the last group by the value of other groups and batches (e.g. if I specify old_BR1 and old_BR2 batches and the values of the all but one of the old_*_* groups, the left out group is a linear combination of these factors). Accordingly, I specify the design matrix as follows:
G= runData$Group;
B= runData$Batch;
design <- cbind(model.matrix( ~0+B ),model.matrix( ~0+G ))
design = design[,!grepl("Gyoung_MPP_0$",colnames(design)) & !grepl("Gold_MPP_0$",colnames(design))]
Here, I leave out two of the group variables (one from old and one from young), which allows me to fit the GLM:
fit <- glmFit(dge,design)
However, once I have this fit, I am at a loss for how I can compare things to two groups that were left out of the design matrix (e.g. I want a contrast that is:
old_MPP_0vs0.5 = Gold_MPP_0.5 - Gold_MPP_0,
, but Gold_MPP_0 is not in the fitted model).
Alternatively, I could leave out two of the batches (e.g. old_BR1 and young_BR1) to make the system solvable, but then I don't think young and old time points will be comparable since this difference was not modeled.
So how can I contrast variables that are not explicitly modeled in the design matrix?
Thanks!
-Carl

I have partly solved the problem. If I want the following contrast:
but
Gyoung_MPP_0is not represented in the design matrix, I can just doI still don't know how to compare two variables that are both not represented in the design matrix, but will update if I figure it out.
I'm afraid that this workaround will still give incorrect results. See my answer to your post.