Question: EdgeR: batches and samples over-specify system, how to contrast un-modeled variables?
0
4.2 years ago by
United States
carlgdeboer0 wrote:

Hi All

I am trying to find differentially expressed genes with edgeR, while simultaneously accounting for batch effects.  Unfortunately, the batches completely encompass the samples, so that if I specify the batches and the samples in my edgeR design matrix, the system is over-specified (i.e. there are infinite optimizations of the GLM, I get the "Design matrix not of full rank" error).  My data represent a time course where I have 2 different ages of mice, 3 cell types, and 7 time points (2*3*7=42).  Each of these was done in two biological replicates, so I have 42*2=84 distinct replicates.  I want to consider each mouse a batch, so there are 2*2 (replicates * old/young) different batches.  Actually, a few samples are missing below, but this shouldn't affect anything.

> runData
id     Batch         Group
1     young_LT_0_1 young_BR1    young_LT_0
2     young_LT_0_2 young_BR2    young_LT_0
3   young_LT_0.5_1 young_BR1  young_LT_0.5
4   young_LT_0.5_2 young_BR2  young_LT_0.5
5     young_LT_1_1 young_BR1    young_LT_1
6     young_LT_1_2 young_BR2    young_LT_1
7    young_LT_12_2 young_BR2   young_LT_12
8      old_LT_12_2   old_BR2     old_LT_12
9     young_LT_2_1 young_BR1    young_LT_2
10    young_LT_2_2 young_BR2    young_LT_2
11      old_LT_2_2   old_BR2      old_LT_2
12    young_LT_4_1 young_BR1    young_LT_4
13    young_LT_4_2 young_BR2    young_LT_4
14    young_LT_8_1 young_BR1    young_LT_8
15    young_LT_8_2 young_BR2    young_LT_8
16   young_MPP_0_1 young_BR1   young_MPP_0
17   young_MPP_0_2 young_BR2   young_MPP_0
18     old_MPP_0_2   old_BR2     old_MPP_0
19 young_MPP_0.5_1 young_BR1 young_MPP_0.5
20 young_MPP_0.5_2 young_BR2 young_MPP_0.5
21   old_MPP_0.5_2   old_BR2   old_MPP_0.5
22   young_MPP_1_1 young_BR1   young_MPP_1
23   young_MPP_1_2 young_BR2   young_MPP_1
24     old_MPP_1_2   old_BR2     old_MPP_1
25  young_MPP_12_1 young_BR1  young_MPP_12
26  young_MPP_12_2 young_BR2  young_MPP_12
27    old_MPP_12_2   old_BR2    old_MPP_12
28   young_MPP_2_1 young_BR1   young_MPP_2
29   young_MPP_2_2 young_BR2   young_MPP_2
30     old_MPP_2_2   old_BR2     old_MPP_2
31   young_MPP_4_1 young_BR1   young_MPP_4
32   young_MPP_4_2 young_BR2   young_MPP_4
33     old_MPP_4_2   old_BR2     old_MPP_4
34   young_MPP_8_1 young_BR1   young_MPP_8
35   young_MPP_8_2 young_BR2   young_MPP_8
36     old_MPP_8_2   old_BR2     old_MPP_8
37    young_ST_0_1 young_BR1    young_ST_0
38    young_ST_0_2 young_BR2    young_ST_0
39      old_ST_0_2   old_BR2      old_ST_0
40  young_ST_0.5_1 young_BR1  young_ST_0.5
41  young_ST_0.5_2 young_BR2  young_ST_0.5
42    old_ST_0.5_2   old_BR2    old_ST_0.5
43    young_ST_1_2 young_BR2    young_ST_1
44      old_ST_1_2   old_BR2      old_ST_1
45   young_ST_12_1 young_BR1   young_ST_12
46   young_ST_12_2 young_BR2   young_ST_12
47     old_ST_12_2   old_BR2     old_ST_12
48    young_ST_2_1 young_BR1    young_ST_2
49    young_ST_2_2 young_BR2    young_ST_2
50      old_ST_2_2   old_BR2      old_ST_2
51    young_ST_4_1 young_BR1    young_ST_4
52    young_ST_4_2 young_BR2    young_ST_4
53      old_ST_4_2   old_BR2      old_ST_4
54    young_ST_8_1 young_BR1    young_ST_8
55    young_ST_8_2 young_BR2    young_ST_8
56      old_ST_8_2   old_BR2      old_ST_8
57      old_LT_4_1   old_BR1      old_LT_4
58      old_LT_8_1   old_BR1      old_LT_8
59     old_LT_12_1   old_BR1     old_LT_12
60      old_LT_0_1   old_BR1      old_LT_0
61    old_LT_0.5_1   old_BR1    old_LT_0.5
62      old_LT_1_1   old_BR1      old_LT_1
63      old_LT_2_1   old_BR1      old_LT_2
64     old_MPP_4_1   old_BR1     old_MPP_4
65     old_MPP_8_1   old_BR1     old_MPP_8
66    old_MPP_12_1   old_BR1    old_MPP_12
67     old_MPP_0_1   old_BR1     old_MPP_0
68   old_MPP_0.5_1   old_BR1   old_MPP_0.5
69     old_MPP_1_1   old_BR1     old_MPP_1
70     old_MPP_2_1   old_BR1     old_MPP_2
71      old_ST_4_1   old_BR1      old_ST_4
72      old_ST_8_1   old_BR1      old_ST_8
73     old_ST_12_1   old_BR1     old_ST_12
74      old_ST_0_1   old_BR1      old_ST_0
75    old_ST_0.5_1   old_BR1    old_ST_0.5
76      old_ST_1_1   old_BR1      old_ST_1
77      old_ST_2_1   old_BR1      old_ST_2

If I specify all batches and all groups in the design matrix, it is over-specified even without an intercept because I can infer the value of the last group by the value of other groups and batches (e.g. if I specify old_BR1 and old_BR2 batches and the values of the all but one of the old_*_* groups, the left out group is a linear combination of these factors).  Accordingly, I specify the design matrix as follows:

G= runData$Group; B= runData$Batch;
design <- cbind(model.matrix( ~0+B ),model.matrix( ~0+G ))
design = design[,!grepl("Gyoung_MPP_0$",colnames(design)) & !grepl("Gold_MPP_0$",colnames(design))]


Here, I leave out two of the group variables (one from old and one from young), which allows me to fit the GLM:

fit <- glmFit(dge,design)

However, once I have this fit, I am at a loss for how I can compare things to two groups that were left out of the design matrix (e.g. I want a contrast that is:

old_MPP_0vs0.5 = Gold_MPP_0.5 - Gold_MPP_0,


, but Gold_MPP_0 is not in the fitted model).

Alternatively, I could leave out two of the batches (e.g. old_BR1 and young_BR1) to make the system solvable, but then I don't think young and old time points will be comparable since this difference was not modeled.

So how can I contrast variables that are not explicitly modeled in the design matrix?

Thanks!

-Carl

modified 4.2 years ago by Gordon Smyth38k • written 4.2 years ago by carlgdeboer0

I have partly solved the problem.  If I want the following contrast:

   young_MPP_0.5vs0 = Gyoung_MPP_0 - Gyoung_MPP_0.5,

but Gyoung_MPP_0 is not represented in the design matrix, I can just do

   young_MPP_0.5vs0 = - Gyoung_MPP_0.5,

I still don't know how to compare two variables that are both not represented in the design matrix, but will update if I figure it out.

1

I'm afraid that this workaround will still give incorrect results. See my answer to your post.

Answer: EdgeR: batches and samples over-specify system, how to contrast un-modeled varia
3
4.2 years ago by
Gordon Smyth38k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth38k wrote:

You can't treat each mouse as a batch. You have the same age factor (young/old) built into both your Batch variable and into your treatment Group, and it is not possible to estimate the same young/old difference twice.

You actually have what is called a "multilevel experiment" in the edgeR and limma User's Guides. Two of the comparisons you want to make (between cell types and between times) can be made within mice, but the other comparison (young vs old) must be made between mice. Multilevel designs need special treatment, and it would be best to read the User's Guides. Personally, I would analyse your experiment using limma-voom and would handle the mouse factor using duplicateCorrelation().

Edit: Experimental designs like this are discussed in Section 3.5 ("Comparisons between and within subjects") of the edgeR User's Guide and in Section 9.7 ("Multi-level experiments") of the limma User's Guide. limma can handle a random effect for mouse and the limma section shows how to conduct any comparison you want to make between the treatment groups. (To match up the example to your experiment, think Subject=Mouse, Condition=Age and Tissue=CellType.Time.)

Section 3.5 in the edgeR User's Guide shows you how to squeeze as much as possible out of the analysis if you choose to treat mouse as a blocking effect. (To match up the example to your experiment, think Patient=Mouse, Disease=Age and Treatment=CellType.Time.) You can only make comparisons within mice. You can compare cell types or times within mice, and you can do this separately for young and old mice. You can also test whether the time effect is the same in young as it is in old mice (an interaction). Similarly for cell type effects. But you cannot estimate any age main effects: e.g., you cannot estimate the old vs young difference either in a specific cell type or overall.

All these considerations remain the same regardless of how many replicate mice you have.

Hi Gordon

Thanks for the reply and for correcting me on the workaround.  I'm afraid I don't really understand why mice cannot be treated as batches.  Since we have two biological replicates, I would think we could learn one weight for each mouse, which brings all the samples from those mice on the same scale, and then learn old/young specific differences within each cell type and time point.  I agree that most of the "age-specific" signal will be normalized out by the batch correction (which is okay), but shouldn't there still be differences between young and old within each cell type?  In the meantime, I will read up on the two functions you mentioned.

You can estimate the batch effect, but then you can't do your DE comparison. This is because the batch effect will absorb your differences of interest (i.e., the age effect). For example, let's say we have a gene where expression in all old samples is up 2-fold compared to the young samples. Is this due to a batch effect between mice, e.g., when the RNA samples from the old mice are somehow processed or handled differently? Or is this due to a genuine age effect between old and young mice? There's no way to tell if you include both factors in your model.

If you are looking at age-specific effects across all cell types and time points, then yes, you are correct.  However, if it is an age/cell type/time point specific effect then it may still be evident as only effects that are consistent across an age will have been normalized out (or the effect will only have been partly normalized out).  E.g. If the only time point that has a "true" difference between young and old is MPP_0, then the batch normalization is looking for a bias between young and old across all cell types and all time points. Since only MPP_0 is truly different, the normalization factor between the two will (mostly) reflect the other samples that only represent young/old bias (and young/old global differences).

In such cases, it'll still be pretty bad if the batch effect partially absorbs the age effect. Your log-fold change for the DE comparison will smaller than the true value by some unknown amount, which makes it difficult to interpret. Of course, the coefficients for the batch effect may be zero if you have an age effect that is very specific to one cell-type or time point - but in such cases, what's the point of blocking on the batch effect anyway? You're throwing away several residual degrees of freedom for no change in the fitted values of the linear model.

Awesome!

Thanks for pointing me in the right direction.  Much appreciated!