Hi All
I am trying to find differentially expressed genes with edgeR, while simultaneously accounting for batch effects. Unfortunately, the batches completely encompass the samples, so that if I specify the batches and the samples in my edgeR design matrix, the system is over-specified (i.e. there are infinite optimizations of the GLM, I get the "Design matrix not of full rank" error). My data represent a time course where I have 2 different ages of mice, 3 cell types, and 7 time points (2*3*7=42). Each of these was done in two biological replicates, so I have 42*2=84 distinct replicates. I want to consider each mouse a batch, so there are 2*2 (replicates * old/young) different batches. Actually, a few samples are missing below, but this shouldn't affect anything.
> runData id Batch Group 1 young_LT_0_1 young_BR1 young_LT_0 2 young_LT_0_2 young_BR2 young_LT_0 3 young_LT_0.5_1 young_BR1 young_LT_0.5 4 young_LT_0.5_2 young_BR2 young_LT_0.5 5 young_LT_1_1 young_BR1 young_LT_1 6 young_LT_1_2 young_BR2 young_LT_1 7 young_LT_12_2 young_BR2 young_LT_12 8 old_LT_12_2 old_BR2 old_LT_12 9 young_LT_2_1 young_BR1 young_LT_2 10 young_LT_2_2 young_BR2 young_LT_2 11 old_LT_2_2 old_BR2 old_LT_2 12 young_LT_4_1 young_BR1 young_LT_4 13 young_LT_4_2 young_BR2 young_LT_4 14 young_LT_8_1 young_BR1 young_LT_8 15 young_LT_8_2 young_BR2 young_LT_8 16 young_MPP_0_1 young_BR1 young_MPP_0 17 young_MPP_0_2 young_BR2 young_MPP_0 18 old_MPP_0_2 old_BR2 old_MPP_0 19 young_MPP_0.5_1 young_BR1 young_MPP_0.5 20 young_MPP_0.5_2 young_BR2 young_MPP_0.5 21 old_MPP_0.5_2 old_BR2 old_MPP_0.5 22 young_MPP_1_1 young_BR1 young_MPP_1 23 young_MPP_1_2 young_BR2 young_MPP_1 24 old_MPP_1_2 old_BR2 old_MPP_1 25 young_MPP_12_1 young_BR1 young_MPP_12 26 young_MPP_12_2 young_BR2 young_MPP_12 27 old_MPP_12_2 old_BR2 old_MPP_12 28 young_MPP_2_1 young_BR1 young_MPP_2 29 young_MPP_2_2 young_BR2 young_MPP_2 30 old_MPP_2_2 old_BR2 old_MPP_2 31 young_MPP_4_1 young_BR1 young_MPP_4 32 young_MPP_4_2 young_BR2 young_MPP_4 33 old_MPP_4_2 old_BR2 old_MPP_4 34 young_MPP_8_1 young_BR1 young_MPP_8 35 young_MPP_8_2 young_BR2 young_MPP_8 36 old_MPP_8_2 old_BR2 old_MPP_8 37 young_ST_0_1 young_BR1 young_ST_0 38 young_ST_0_2 young_BR2 young_ST_0 39 old_ST_0_2 old_BR2 old_ST_0 40 young_ST_0.5_1 young_BR1 young_ST_0.5 41 young_ST_0.5_2 young_BR2 young_ST_0.5 42 old_ST_0.5_2 old_BR2 old_ST_0.5 43 young_ST_1_2 young_BR2 young_ST_1 44 old_ST_1_2 old_BR2 old_ST_1 45 young_ST_12_1 young_BR1 young_ST_12 46 young_ST_12_2 young_BR2 young_ST_12 47 old_ST_12_2 old_BR2 old_ST_12 48 young_ST_2_1 young_BR1 young_ST_2 49 young_ST_2_2 young_BR2 young_ST_2 50 old_ST_2_2 old_BR2 old_ST_2 51 young_ST_4_1 young_BR1 young_ST_4 52 young_ST_4_2 young_BR2 young_ST_4 53 old_ST_4_2 old_BR2 old_ST_4 54 young_ST_8_1 young_BR1 young_ST_8 55 young_ST_8_2 young_BR2 young_ST_8 56 old_ST_8_2 old_BR2 old_ST_8 57 old_LT_4_1 old_BR1 old_LT_4 58 old_LT_8_1 old_BR1 old_LT_8 59 old_LT_12_1 old_BR1 old_LT_12 60 old_LT_0_1 old_BR1 old_LT_0 61 old_LT_0.5_1 old_BR1 old_LT_0.5 62 old_LT_1_1 old_BR1 old_LT_1 63 old_LT_2_1 old_BR1 old_LT_2 64 old_MPP_4_1 old_BR1 old_MPP_4 65 old_MPP_8_1 old_BR1 old_MPP_8 66 old_MPP_12_1 old_BR1 old_MPP_12 67 old_MPP_0_1 old_BR1 old_MPP_0 68 old_MPP_0.5_1 old_BR1 old_MPP_0.5 69 old_MPP_1_1 old_BR1 old_MPP_1 70 old_MPP_2_1 old_BR1 old_MPP_2 71 old_ST_4_1 old_BR1 old_ST_4 72 old_ST_8_1 old_BR1 old_ST_8 73 old_ST_12_1 old_BR1 old_ST_12 74 old_ST_0_1 old_BR1 old_ST_0 75 old_ST_0.5_1 old_BR1 old_ST_0.5 76 old_ST_1_1 old_BR1 old_ST_1 77 old_ST_2_1 old_BR1 old_ST_2
If I specify all batches and all groups in the design matrix, it is over-specified even without an intercept because I can infer the value of the last group by the value of other groups and batches (e.g. if I specify old_BR1 and old_BR2 batches and the values of the all but one of the old_*_* groups, the left out group is a linear combination of these factors). Accordingly, I specify the design matrix as follows:
G= runData$Group; B= runData$Batch; design <- cbind(model.matrix( ~0+B ),model.matrix( ~0+G )) design = design[,!grepl("Gyoung_MPP_0$",colnames(design)) & !grepl("Gold_MPP_0$",colnames(design))]
Here, I leave out two of the group variables (one from old and one from young), which allows me to fit the GLM:
fit <- glmFit(dge,design)
However, once I have this fit, I am at a loss for how I can compare things to two groups that were left out of the design matrix (e.g. I want a contrast that is:
old_MPP_0vs0.5 = Gold_MPP_0.5 - Gold_MPP_0,
, but Gold_MPP_0 is not in the fitted model).
Alternatively, I could leave out two of the batches (e.g. old_BR1 and young_BR1) to make the system solvable, but then I don't think young and old time points will be comparable since this difference was not modeled.
So how can I contrast variables that are not explicitly modeled in the design matrix?
Thanks!
-Carl
I have partly solved the problem. If I want the following contrast:
but
Gyoung_MPP_0
is not represented in the design matrix, I can just doI still don't know how to compare two variables that are both not represented in the design matrix, but will update if I figure it out.
I'm afraid that this workaround will still give incorrect results. See my answer to your post.