Hi
i have a set of RNAseq data with unbalanced batch effect (see table below). Batch 1 was made of single indexed kit and sequenced at time 1, while batch 2 was made of dual indexed kit and sequenced independently from batch 1.
sample |
batch |
groups |
|
1 |
Naive_Dmt3aKO_rep1 |
2 |
Naive_Dmt3aKO |
2 |
Naive_Dmt3aKO_rep2 |
2 |
Naive_Dmt3aKO |
3 |
Naive_Dmt3aKO_rep3 |
2 |
Naive_Dmt3aKO |
4 |
Naive_WT_rep1 |
2 |
Naive_WT |
5 |
Naive_WT_rep2 |
2 |
Naive_WT |
6 |
Naive_WT_rep3 |
2 |
Naive_WT |
7 |
Th17_Dmt3aKO_rep1 |
2 |
Th17_Dmt3aKO |
8 |
Th17_Dmt3aKO_rep2 |
2 |
Th17_Dmt3aKO |
9 |
Th17_Dmt3aKO_rep3 |
1 |
Th17_Dmt3aKO |
10 |
Th17_WT_rep1 |
2 |
Th17_WT |
11 |
Th17_WT_rep2 |
2 |
Th17_WT |
12 |
Th17_WT_rep3 |
1 |
Th17_WT |
13 |
Th1_Dmt3aKO_rep1 |
2 |
Th1_Dmt3aKO |
14 |
Th1_Dmt3aKO_rep2 |
2 |
Th1_Dmt3aKO |
15 |
Th1_Dmt3aKO_rep3 |
1 |
Th1_Dmt3aKO |
16 |
Th1_WT_rep1 |
2 |
Th1_WT |
17 |
Th1_WT_rep2 |
2 |
Th1_WT |
18 |
Th1_WT_rep3 |
1 |
Th1_WT |
19 |
Th2_Dmt3aKO_rep1 |
2 |
Th2_Dmt3aKO |
20 |
Th2_Dmt3aKO_rep2 |
2 |
Th2_Dmt3aKO |
21 |
Th2_Dmt3aKO_rep3 |
1 |
Th2_Dmt3aKO |
22 |
Th2_WT_rep1 |
2 |
Th2_WT |
23 |
Th2_WT_rep2 |
2 |
Th2_WT |
24 |
Th2_WT_rep3 |
1 |
Th2_WT |
From my exploratory analysis, I noticed batch 1 samples and batch 2 samples are clustered independently from each other. https://ibb.co/meWjcz
Thus, on my DE analysis design, I used batch as covariant to evaluate the batch effect DE.
groups <- relevel(groups, ref="Naive_KO") batch <- relevel(batch,ref="1") design <- model.matrix(~batch+groups, data=y$samples) y_filtered <- estimateDisp(y_filtered,design) fit <- glmQLFit(y_filtered, design, robust=T)
I found 24439 genes were differentially expressed btw batch 1 and batch 2.
#### batch effect batch_DE <- glmQLFTest(fit, coef=2) FDR <- p.adjust(batch_DE$table$PValue, 'fdr') sum(FDR < 0.05) # 24439
My questions:
1. since Naive has only batch 2 samples no batch 1 samples, Can 24439 batch DE genes be caused by difference between other Th and Naive? In the linear regression, ~batch+groups
, we assume batch and group are independent. theoretically, those 24439 genes should be independent of group difference, but this unbalanced design really bothers me.
2. Will this unbalance design affect differential analysis between groups? for example, comparing Th1_WT to Naive_WT.
To clarify James' comments, note that your design is not fully confounded; it's still possible to perform contrasts between all pairs of groups. However, the DE genes between batches cannot be caused by any of the naive samples. Any inter-batch differences are estimated solely from the
Th*
groups that have samples in both batches.For your second question, the differences in the number of samples between groups or batches will not compromise the validity of the DE analysis with respect to controlling the error rate. Obviously, if you had equal numbers of samples in each group (or if you had no batch effects in the first place), you would get more power, but that can't be helped now.