Building model matrix to correct for batch effect with biological and technical replicates: Problem with linear correlations
1
0
Entering edit mode
mfaleevs • 0
@99060f43
Last seen 22 months ago
United Kingdom

Hello all, I was hoping someone here could help me out. I recently conducted some MASS SPEC for my samples. Each sample was run thrice through the machine. However, there was a large space of time between the first run and the consequent second and third run (both run at the same time), so I would like to conduct a batch effect. My data set looks something like this:

Sample      |  Biological Rep 1                 |      Biological Rep 2             |
Condition   |C1      |   C2   |  C3    | C4     | c1     |  C2    |  C3    | C4     |
Tech repeat |1 |2 |2 |1 |2 |2 |1 |2 |2 |1 |2 |2 |1 |2 |2 |1 |2 |2 |1 |2 |2 |1 | 2|2 |


The actual dataset looks a bit like this sample (only showing one biological repeat)

Protein  |C1r1  |C1r2  |c1r3  |c2r1  |c2r2  | c2r3 | c3r1 |c3r2   |c3r3 |c4r1  |c4r2  |c4r3 |
----------------------------------------------------------------------------------------------
Protein1 |19    |34    |45    |10    |23    |22    |16    |92    |28    |11    |29    |23    |
Protein2 |12    |24    |23    |11    |24    |23    |15    |21    |65    |19    |21    |26    |


In the tech repeats, 1 was the technical repeat run first, and 2 represents the second and third repeat that were run at the same time.

The model matrix that I have tried to conduct goes like this:

tr1<- as.factor(rep(c(1,2,2),8)) #batch one technical repeat vs 2/3 technical repeat
ms1<- as.factor(c(rep(1,6), rep(2,6), rep(3,6), rep(4,6))) #4 samples, 6 times run
ex1<- as.factor(c(rep(1,3), rep(2,3), rep(3,3), rep(4,3), rep(1,3), rep(2,3), rep(3,3), rep(4,3))) # 2 biological repeat for each sample, each run thrice
design1<- model.matrix(~ex1 + ms1+tr1)
block <- c(1:6, 1:6, 1:6, 1:6)
dupcor = duplicateCorrelation(df, design = design1,  block = block)
fit <- lmFit(df, design1, block = block, correlation = dupcor\$consensus)


The design matrix looks like this:

(Intercept) ex12 ex13 ex14 ms12 ms13 ms14 tr12
1            1    0    0    0    0    0    0    0
2            1    0    0    0    0    0    0    1
3            1    0    0    0    0    0    0    1
4            1    1    0    0    0    0    0    0
5            1    1    0    0    0    0    0    1
6            1    1    0    0    0    0    0    1
7            1    0    1    0    1    0    0    0
8            1    0    1    0    1    0    0    1
9            1    0    1    0    1    0    0    1
10           1    0    0    1    1    0    0    0
11           1    0    0    1    1    0    0    1
12           1    0    0    1    1    0    0    1
13           1    0    0    0    0    1    0    0
14           1    0    0    0    0    1    0    1
15           1    0    0    0    0    1    0    1
16           1    1    0    0    0    1    0    0
17           1    1    0    0    0    1    0    1
18           1    1    0    0    0    1    0    1
19           1    0    1    0    0    0    1    0
20           1    0    1    0    0    0    1    1
21           1    0    1    0    0    0    1    1
22           1    0    0    1    0    0    1    0
23           1    0    0    1    0    0    1    1
24           1    0    0    1    0    0    1    1


However, when I run the code it tells me that I have a linear combination in some of my variables

Note: design matrix not of full rank (1 coef not estimable).


If I remove any variables from the matrix then I run the risk of not accounting for everything. How do I navigate such a problem? Any input would be greatly appreciated! Thank you

MassSpectrometryData BatchEffect r limma • 953 views
0
Entering edit mode
@gordon-smyth
Last seen 54 minutes ago
WEHI, Melbourne, Australia

This is basically a simple experiment with four conditions and two biological replicates. Your design matrix is unnecessarily complicated and I don't actually follow the logical meaning of some of the variables you have created. I can see that tr1 is the batch effect and I am guessing that ex1 represents the condition, but the meaning of ms1 and block is not evident to me.

Although the MASS SPEC was run three times on each sample, technical replicates of this sort cannot be treated as biological replicates. A simple and good approach would be to average the three technical replicates for each condition and biological replicate. Then you would have a simple experiment with 8 sample values representing two biological replicates of 4 conditions. There would be no batch effect and no blocks. The design matrix formula would be just ~Condition. In this approach you would use:

BiolRep <- gl(8,3,24)
dfa <- avearrays(df, ID=BiolRep)
Condition <- gl(4,1,8)
design <- model.matrix(~Condition)
fit <- lmFit(dfa, design)


The alternative would be to keep all the technical replicates separate and use duplicate correlation. In this approach you would use:

Condition <- gl(4,3,24) # same as your ex1
Batch <- factor(tr1)
BiolRep <- gl(8,3,24)
design <- model.matrix(~Condition+Batch)
dupcor = duplicateCorrelation(df, design,  block = BiolRep)


I recommend the first approach with averaging as more robust, but the second approach with duplicateCorrelation will be more powerful and will find more DE proteins.