I have an experiment with 30 samples that I aligned using STAR and created a counts file using featureCounts. I'm struggling to properly set up the design matrix as cursory analysis shows a clear donor effect in the MDS plot.
| Sample | Donor | Group |
| 1 | 1 | Baseline |
| 2 | 1 | IL2 |
| 3 | 1 | IL2ZA |
| 4 | 1 | IL2OKT3 |
| 5 | 1 | IL2OKT3ZA |
| 6 | 2 | Baseline |
| 7 | 2 | IL2 |
| 8 | 2 | IL2ZA |
| 9 | 2 | IL2OKT3 |
| 10 | 2 | IL2OKT3ZA |
| 11 | 3 | Baseline |
| 12 | 3 | IL2 |
| 13 | 3 | IL2ZA |
| 14 | 3 | IL2OKT3 |
| 15 | 3 | IL2OKT3ZA |
| 16 | 4 | Baseline |
| 17 | 4 | IL2 |
| 18 | 4 | IL2ZA |
| 19 | 4 | IL2OKT3 |
| 20 | 4 | IL2OKT3ZA |
| 21 | 5 | Baseline |
| 22 | 5 | IL2 |
| 23 | 5 | IL2ZA |
| 24 | 5 | IL2OKT3 |
| 25 | 5 | IL2OKT3ZA |
| 26 | 6 | Baseline |
| 27 | 6 | IL2 |
| 28 | 6 | IL2ZA |
| 29 | 6 | IL2OKT3 |
| 30 | 6 | IL2OKT3ZA |
I've set up the following using edgeR, but I'm not sure that this design is properly accounting for the donor effect I see in the MDS plot:
names(read.counts) <- c("Donor1_T0", "Donor1_IL2", "Donor1_IL2ZA", "Donor1_IL2OKT3", "Donor1_IL2OKT3ZA", "Donor2_T0", "Donor2_IL2", "Donor2_IL2ZA", "Donor2_IL2OKT3", "Donor2_IL2OKT3ZA", "Donor3_T0", "Donor3_IL2", "Donor3_IL2ZA", "Donor3_IL2OKT3", "Donor3_IL2OKT3ZA", "Donor4_T0", "Donor4_IL2", "Donor4_IL2ZA", "Donor4_IL2OKT3", "Donor4_IL2OKT3ZA", "Donor5_T0", "Donor5_IL2", "Donor5_IL2ZA", "Donor5_IL2OKT3", "Donor5_IL2OKT3ZA", "Donor6_T0", "Donor6_IL2", "Donor6_IL2ZA", "Donor6_IL2OKT3", "Donor6_IL2OKT3ZA")
sample.type <- factor(c ("T0", "IL2", "IL2ZA", "IL2OKT3", "IL2OKT3ZA","T0", "IL2", "IL2ZA", "IL2OKT3", "IL2OKT3ZA","T0", "IL2", "IL2ZA", "IL2OKT3", "IL2OKT3ZA","T0", "IL2", "IL2ZA", "IL2OKT3", "IL2OKT3ZA","T0", "IL2", "IL2ZA", "IL2OKT3", "IL2OKT3ZA", "T0", "IL2", "IL2ZA", "IL2OKT3", "IL2OKT3ZA"))
sample.type <- relevel(sample.type, "T0")
edgeR.DGElist <- DGEList(counts = read.counts , group = sample.type)
keep <- rowSums( cpm(edgeR.DGElist) > 1) >= 6
edgeR.DGElist <- edgeR.DGElist[keep ,]
edgeR.DGElist$samples$lib.size <- colSums(edgeR.DGElist$counts)
edgeR.DGElist <- calcNormFactors(edgeR.DGElist , method = "TMM")
design <- model.matrix(~0+group, data=edgeR.DGElist$samples)
colnames(design) <- levels(edgeR.DGElist$samples$group)
Does this look correct?

If I create,
donor <- factor(c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6))And modify the design matrix,
colnames(design) <- sub("^Group", "", colnames(design))I don't see donor1 as a group in the matrix. Am I missing something?
You have it correct. Donor 1 is implicitly included in the design because the donor coefficients represent the difference between Donor 1 and the other donors.
OK, thanks.
After I estimate dispersions and test for DE:
edgeR.DGElist <- estimateDisp(edgeR.DGElist, design, robust=TRUE)fit <- glmQLFit(edgeR.DGElist, design, robust=TRUE)my.contrasts <- makeContrasts(IL2vsT0=groupIL2-groupT0, IL2OKT3vsT0=groupIL2OKT3-groupT0, IL2OKT3ZAvsT0=groupIL2OKT3ZA-groupT0, IL2ZAvsT0=groupIL2ZA-groupT0, IL2OKT3vsIL2=groupIL2OKT3-groupIL2, IL2ZAvsIL2=groupIL2ZA-groupIL2, IL2OKT3ZAvsIL2OKT3=groupIL2OKT3ZA-groupIL2OKT3, IL2OKT3ZAvsIL2ZA=groupIL2OKT3ZA-groupIL2ZA, levels=design)res.IL2vsT0 <- glmQLFTest(fit, contrast=my.contrasts[,"IL2vsT0"]) #IL2 minus baselinetopTags(res.IL2vsT0)Do I need to also include a contrast argument between donors?
That's a question to ask yourself. Do you care about differential expression between donors? If so, then yes, you should should assess differential expression between donors.
Right, I don't, I just want to make sure the donor effect I was seeing in the MDS plot is accounted for in the model:
If I remove the 0+ from the design, will all other columns be relative to the first column ("groupT0")? This is the baseline control, and my thought is that this is the best design since I only want to compare treatment groups and remove the baseline from all samples.
Yes, if you don't include the "0+", that is what you will get: an intercept column representing the expression level in the baseline group, and then a column for each other group representing the difference between that group and the baseline. This is an equally valid approach, and should yield identical p-values to the contrast-based approach that you described above.
OK. Last question and I promise to quit taking up your time :)
In the (~group + donor) model, the columns are
[1] 0 1 1 1 1 2 2 2 2 2.If I want to compare treatments only, then for example, comparing IL2OKT3 vs IL2 looks like:fit <- glmQLFit(edgeR.DGElist, contrast=c(0,1,-1,0,0,0,0,0,0,0))Correct?
Without seeing the actual columns of the design matrix, I can't tell you if that contrast is correct. Just use the makeContrasts function to create your contrasts, and you don't have to worry about it.