Hi all,

I have the following experimental design, with three biological replicates per `Sample.Type`

performed in two batches (two different experimental dates, `Exp.Date`

):

```
> metadata
Sample.ID Sample.Type Exp.Date
1 Ctrl_1 Ctrl 1
2 Ctrl_2 Ctrl 1
3 Ctrl_3 Ctrl 2
4 TxA_1 TxA 1
5 TxA_2 TxA 1
6 TxA_3 TxA 2
7 TxB_1 TxB 1
8 TxB_2 TxB 1
9 TxB_3 TxB 2
10 TxC_1 TxC 2
11 TxC_2 TxC 2
12 TxC_3 TxC 2
```

The aim is to perform differential expression between `Ctrl vs. TxA`

, `Ctrl vs. TxB`

, and `Ctrl vs. TxC`

sample types.

However, when plotting the TMM normalised data using PCA, I noticed PC2 (27% variance) was associated with the batch (shape represents `Exp.Date`

, and colour represents `Sample.Type`

):

So for differential expression analysis using edgeR, I thought it would be best to use `model.matrix(~Batch + Treatment)`

for the model formula, where `Batch`

represents `metadata$Exp.Date`

, as per the edgeR user guide Section 3.4.3 *"Batch effects"*. However, unlike the examples in the edgeR user guide, I have the situation where a sample type is not present in every batch (i.e. `TxC`

).

My question is, given group `TxC`

is only present in batch 2 (and has no samples in batch 1), is this the correct way to deal with the batch effect, given I will be testing `TxC vs. Ctrl`

? Or do I need to analyse `TxC`

samples separately, i.e. compare `TxC`

(n = 3) vs. `Ctrl_3`

(n = 1), given this `Ctrl`

sample was processed in the same batch as the `TxC`

samples?

My understanding was that if you want to model the batch effect, every sample type must be represented in every batch, but unsure if that's right?

Your help and advice would be greatly appreciated!

Many thanks, Rebecca

Ok good to know. Thanks so much for your help Gordon, I really appreciate it.