Question: Confounded design, batch effect correction (edgeR)
gravatar for Laura
8 weeks ago by
Laura20 wrote:


I would like to find differentially expressed genes between different groups in RNA-seq data. I have 55 samples that were prepared either in batch 3, batch 4 or batch 6. These samples represent 4 groups (NR, NT, P, F). When I checked the MDS plot, all the samples were separated by group and by batch, so it seems that the batch effect should be corrected. The problem is that one group (group F) has only 4 samples, which were all prepared in batch 3. All these groups are distinct from each other, so none of the groups can be considered as a reference/control.

I would like do the following comparisons:
- NR vs. NT
- NR and NT vs. P
- NR, NT and P vs. F
- P vs. NR, NT and F
- NR vs. P, F and NT
- NT vs. P, F and NT


I am a bit unsure about a couple of things:

1) How to build the design matrix. I don't have a clear reference/control or "baseline", so I guess I should not include the intercept to the design matrix?

2) How group F affects the de-analysis. For example, if I compare group NT to group F, is the model comparing samples which were prepared within the same batch (NT and F samples which were prepared in batch 3), and ignoring NT samples which were prepared in batch 4 or batch 6? Or should I just ignore group F samples?

3) How to make the correct contrasts for the glmQLFTest function. I think the first three comparisons are correct (the code below), but what about the last three comparisons? How can I make those?


Here are the samples and the design matrix:

group <- factor(c(rep("NR", 11), rep("NT", 26), rep("P", 14), rep("F", 4)))
batch <- factor(c(3,3,3,4,4,6,3,3,3,6,6,6,6,6,4,6,3,3,3,3,3,4,4,4,4,6,6,6,6,3,3,6,6,6,6,6,6,3,3,3,3,4,4,4,6,3,3,3,3,6,6,3,3,3,3))
design <- model.matrix(~0 + batch + group)

Example how the columns in the design matrix look like:

batch3 batch4 batch5 groupNR groupNT groupP


Contrasts for the first three comparisons: 
- NR vs. NT --> c(0,0,0,1,-1,0)
- NR and NT vs. P --> c(0,0,0,0.5,0.5,-1)
- NR, NT and P vs. F --> c(0,0,0,0.33,0.33,0.33)


Thank you in advance for all the help.


ADD COMMENTlink modified 8 weeks ago by Aaron Lun21k • written 8 weeks ago by Laura20
gravatar for Aaron Lun
8 weeks ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

It is good to see a question that gives me code to reproduce the design matrix. Pat yourself on the back.

For question 1: your design matrix is fine. It doesn't matter whether you have an intercept or not, the model is mathematically equivalent. It's just the interpretation of the coefficients that needs to change. For simplicity, I would do:

design <- model.matrix(~0 + group + batch)

... instead, which means that the first four coefficients represent the average log-expression of each group. The original design in your post defines batch terms that represent the log-expression of group F in each batch (yes, even though F only has samples in batch 3!) and the remaining terms represent the log-fold change of each group from F.

For question 2: you have an additive model so edgeR will also use information from the other batches to help estimation of the coefficients for the non-F groups. edgeR will also use information from the other batches to estimate the dispersion. So, there is no need to ignore samples or batches - using all of the data is fine.

For question 3: you'll find life a lot easier with the design I've described above. It's a simple case of doing:

con <- makeContrasts(groupNR - groupNT, levels=design)

... or:

con <- makeContrasts(groupF - (groupNR + groupNT + groupP)/3, levels=design)

... and so on.

ADD COMMENTlink modified 8 weeks ago • written 8 weeks ago by Aaron Lun21k

Thanks a lot Aaron!

ADD REPLYlink written 8 weeks ago by Laura20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 222 users visited in the last hour