Question

edgeR batch effect correction for unevenly distributed groups

0

Entering edit mode

rebecca.lea.johnston • 0

@rebeccaleajohnston-22750

Last seen 2.5 years ago

Australia

Hi all,

I have the following experimental design, with three biological replicates per Sample.Type performed in two batches (two different experimental dates, Exp.Date):

> metadata
   Sample.ID Sample.Type Exp.Date
1     Ctrl_1        Ctrl        1
2     Ctrl_2        Ctrl        1
3     Ctrl_3        Ctrl        2
4      TxA_1         TxA        1
5      TxA_2         TxA        1
6      TxA_3         TxA        2
7      TxB_1         TxB        1
8      TxB_2         TxB        1
9      TxB_3         TxB        2
10     TxC_1         TxC        2
11     TxC_2         TxC        2
12     TxC_3         TxC        2

The aim is to perform differential expression between Ctrl vs. TxA, Ctrl vs. TxB, and Ctrl vs. TxC sample types.

However, when plotting the TMM normalised data using PCA, I noticed PC2 (27% variance) was associated with the batch (shape represents Exp.Date, and colour represents Sample.Type):

Figure of PC1vsPC2

So for differential expression analysis using edgeR, I thought it would be best to use model.matrix(~Batch + Treatment) for the model formula, where Batch represents metadata$Exp.Date, as per the edgeR user guide Section 3.4.3 "Batch effects". However, unlike the examples in the edgeR user guide, I have the situation where a sample type is not present in every batch (i.e. TxC).

My question is, given group TxC is only present in batch 2 (and has no samples in batch 1), is this the correct way to deal with the batch effect, given I will be testing TxC vs. Ctrl? Or do I need to analyse TxC samples separately, i.e. compare TxC (n = 3) vs. Ctrl_3 (n = 1), given this Ctrl sample was processed in the same batch as the TxC samples?

My understanding was that if you want to model the batch effect, every sample type must be represented in every batch, but unsure if that's right?

Your help and advice would be greatly appreciated!

Many thanks, Rebecca

rnaseq edgeR batch effect • 960 views

ADD COMMENT • link updated 3.7 years ago by Gordon Smyth 50k • written 3.7 years ago by rebecca.lea.johnston • 0

score 2 · Accepted Answer · 2020-08-04

2

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

No, it isn't necessary for each sample type to be represented in every batch, but the lack of balance will result in some loss of statistical power. When you form the Ctrl vs TxC contrast you will effectively be using only the Batch=2 samples, so you will get the power of an n=1 vs n=3 comparison.

You don't need to handle TxC separately, it will be handled automatically by the linear model.