Hi all,
I have the following experimental design, with three biological replicates per Sample.Type
performed in two batches (two different experimental dates, Exp.Date
):
> metadata
Sample.ID Sample.Type Exp.Date
1 Ctrl_1 Ctrl 1
2 Ctrl_2 Ctrl 1
3 Ctrl_3 Ctrl 2
4 TxA_1 TxA 1
5 TxA_2 TxA 1
6 TxA_3 TxA 2
7 TxB_1 TxB 1
8 TxB_2 TxB 1
9 TxB_3 TxB 2
10 TxC_1 TxC 2
11 TxC_2 TxC 2
12 TxC_3 TxC 2
The aim is to perform differential expression between Ctrl vs. TxA
, Ctrl vs. TxB
, and Ctrl vs. TxC
sample types.
However, when plotting the TMM normalised data using PCA, I noticed PC2 (27% variance) was associated with the batch (shape represents Exp.Date
, and colour represents Sample.Type
):
So for differential expression analysis using edgeR, I thought it would be best to use model.matrix(~Batch + Treatment)
for the model formula, where Batch
represents metadata$Exp.Date
, as per the edgeR user guide Section 3.4.3 "Batch effects". However, unlike the examples in the edgeR user guide, I have the situation where a sample type is not present in every batch (i.e. TxC
).
My question is, given group TxC
is only present in batch 2 (and has no samples in batch 1), is this the correct way to deal with the batch effect, given I will be testing TxC vs. Ctrl
? Or do I need to analyse TxC
samples separately, i.e. compare TxC
(n = 3) vs. Ctrl_3
(n = 1), given this Ctrl
sample was processed in the same batch as the TxC
samples?
My understanding was that if you want to model the batch effect, every sample type must be represented in every batch, but unsure if that's right?
Your help and advice would be greatly appreciated!
Many thanks, Rebecca
Ok good to know. Thanks so much for your help Gordon, I really appreciate it.