edgeR batch effect correction for unevenly distributed groups
1
0
Entering edit mode
@rebeccaleajohnston-22750
Last seen 17 months ago
Australia

Hi all,

I have the following experimental design, with three biological replicates per Sample.Type performed in two batches (two different experimental dates, Exp.Date):

> metadata
Sample.ID Sample.Type Exp.Date
1     Ctrl_1        Ctrl        1
2     Ctrl_2        Ctrl        1
3     Ctrl_3        Ctrl        2
4      TxA_1         TxA        1
5      TxA_2         TxA        1
6      TxA_3         TxA        2
7      TxB_1         TxB        1
8      TxB_2         TxB        1
9      TxB_3         TxB        2
10     TxC_1         TxC        2
11     TxC_2         TxC        2
12     TxC_3         TxC        2


The aim is to perform differential expression between Ctrl vs. TxA, Ctrl vs. TxB, and Ctrl vs. TxC sample types.

However, when plotting the TMM normalised data using PCA, I noticed PC2 (27% variance) was associated with the batch (shape represents Exp.Date, and colour represents Sample.Type):

Figure of PC1vsPC2

So for differential expression analysis using edgeR, I thought it would be best to use model.matrix(~Batch + Treatment) for the model formula, where Batch represents metadata\$Exp.Date, as per the edgeR user guide Section 3.4.3 "Batch effects". However, unlike the examples in the edgeR user guide, I have the situation where a sample type is not present in every batch (i.e. TxC).

My question is, given group TxC is only present in batch 2 (and has no samples in batch 1), is this the correct way to deal with the batch effect, given I will be testing TxC vs. Ctrl? Or do I need to analyse TxC samples separately, i.e. compare TxC (n = 3) vs. Ctrl_3 (n = 1), given this Ctrl sample was processed in the same batch as the TxC samples?

My understanding was that if you want to model the batch effect, every sample type must be represented in every batch, but unsure if that's right?

Many thanks, Rebecca

rnaseq edgeR batch effect • 645 views
2
Entering edit mode
@gordon-smyth
Last seen 1 minute ago
WEHI, Melbourne, Australia

No, it isn't necessary for each sample type to be represented in every batch, but the lack of balance will result in some loss of statistical power. When you form the Ctrl vs TxC contrast you will effectively be using only the Batch=2 samples, so you will get the power of an n=1 vs n=3 comparison.

You don't need to handle TxC separately, it will be handled automatically by the linear model.

0
Entering edit mode

Ok good to know. Thanks so much for your help Gordon, I really appreciate it.