I have an experiment with 30 samples that I aligned using STAR and created a counts file using featureCounts. I'm struggling to properly set up the design matrix as cursory analysis shows a clear donor effect in the MDS plot.
Sample | Donor | Group |
1 | 1 | Baseline |
2 | 1 | IL2 |
3 | 1 | IL2ZA |
4 | 1 | IL2OKT3 |
5 | 1 | IL2OKT3ZA |
6 | 2 | Baseline |
7 | 2 | IL2 |
8 | 2 | IL2ZA |
9 | 2 | IL2OKT3 |
10 | 2 | IL2OKT3ZA |
11 | 3 | Baseline |
12 | 3 | IL2 |
13 | 3 | IL2ZA |
14 | 3 | IL2OKT3 |
15 | 3 | IL2OKT3ZA |
16 | 4 | Baseline |
17 | 4 | IL2 |
18 | 4 | IL2ZA |
19 | 4 | IL2OKT3 |
20 | 4 | IL2OKT3ZA |
21 | 5 | Baseline |
22 | 5 | IL2 |
23 | 5 | IL2ZA |
24 | 5 | IL2OKT3 |
25 | 5 | IL2OKT3ZA |
26 | 6 | Baseline |
27 | 6 | IL2 |
28 | 6 | IL2ZA |
29 | 6 | IL2OKT3 |
30 | 6 | IL2OKT3ZA |
I've set up the following using edgeR, but I'm not sure that this design is properly accounting for the donor effect I see in the MDS plot:
names(read.counts) <- c("Donor1_T0", "Donor1_IL2", "Donor1_IL2ZA", "Donor1_IL2OKT3", "Donor1_IL2OKT3ZA", "Donor2_T0", "Donor2_IL2", "Donor2_IL2ZA", "Donor2_IL2OKT3", "Donor2_IL2OKT3ZA", "Donor3_T0", "Donor3_IL2", "Donor3_IL2ZA", "Donor3_IL2OKT3", "Donor3_IL2OKT3ZA", "Donor4_T0", "Donor4_IL2", "Donor4_IL2ZA", "Donor4_IL2OKT3", "Donor4_IL2OKT3ZA", "Donor5_T0", "Donor5_IL2", "Donor5_IL2ZA", "Donor5_IL2OKT3", "Donor5_IL2OKT3ZA", "Donor6_T0", "Donor6_IL2", "Donor6_IL2ZA", "Donor6_IL2OKT3", "Donor6_IL2OKT3ZA") sample.type <- factor(c ("T0", "IL2", "IL2ZA", "IL2OKT3", "IL2OKT3ZA","T0", "IL2", "IL2ZA", "IL2OKT3", "IL2OKT3ZA","T0", "IL2", "IL2ZA", "IL2OKT3", "IL2OKT3ZA","T0", "IL2", "IL2ZA", "IL2OKT3", "IL2OKT3ZA","T0", "IL2", "IL2ZA", "IL2OKT3", "IL2OKT3ZA", "T0", "IL2", "IL2ZA", "IL2OKT3", "IL2OKT3ZA")) sample.type <- relevel(sample.type, "T0") edgeR.DGElist <- DGEList(counts = read.counts , group = sample.type) keep <- rowSums( cpm(edgeR.DGElist) > 1) >= 6 edgeR.DGElist <- edgeR.DGElist[keep ,] edgeR.DGElist$samples$lib.size <- colSums(edgeR.DGElist$counts) edgeR.DGElist <- calcNormFactors(edgeR.DGElist , method = "TMM") design <- model.matrix(~0+group, data=edgeR.DGElist$samples) colnames(design) <- levels(edgeR.DGElist$samples$group)
Does this look correct?
If I create,
donor <- factor(c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6))
And modify the design matrix,
I don't see donor1 as a group in the matrix. Am I missing something?
You have it correct. Donor 1 is implicitly included in the design because the donor coefficients represent the difference between Donor 1 and the other donors.
OK, thanks.
After I estimate dispersions and test for DE:
edgeR.DGElist <- estimateDisp(edgeR.DGElist, design, robust=TRUE)
fit <- glmQLFit(edgeR.DGElist, design, robust=TRUE)
my.contrasts <- makeContrasts(IL2vsT0=groupIL2-groupT0, IL2OKT3vsT0=groupIL2OKT3-groupT0, IL2OKT3ZAvsT0=groupIL2OKT3ZA-groupT0, IL2ZAvsT0=groupIL2ZA-groupT0, IL2OKT3vsIL2=groupIL2OKT3-groupIL2, IL2ZAvsIL2=groupIL2ZA-groupIL2, IL2OKT3ZAvsIL2OKT3=groupIL2OKT3ZA-groupIL2OKT3, IL2OKT3ZAvsIL2ZA=groupIL2OKT3ZA-groupIL2ZA, levels=design)
res.IL2vsT0 <- glmQLFTest(fit, contrast=my.contrasts[,"IL2vsT0"]) #IL2 minus baseline
topTags(res.IL2vsT0)
Do I need to also include a contrast argument between donors?
That's a question to ask yourself. Do you care about differential expression between donors? If so, then yes, you should should assess differential expression between donors.
Right, I don't, I just want to make sure the donor effect I was seeing in the MDS plot is accounted for in the model:
If I remove the 0+ from the design, will all other columns be relative to the first column ("groupT0")? This is the baseline control, and my thought is that this is the best design since I only want to compare treatment groups and remove the baseline from all samples.
Yes, if you don't include the "0+", that is what you will get: an intercept column representing the expression level in the baseline group, and then a column for each other group representing the difference between that group and the baseline. This is an equally valid approach, and should yield identical p-values to the contrast-based approach that you described above.
OK. Last question and I promise to quit taking up your time :)
In the (~group + donor) model, the columns are
[1] 0 1 1 1 1 2 2 2 2 2.
If I want to compare treatments only, then for example, comparing IL2OKT3 vs IL2 looks like:fit <- glmQLFit(edgeR.DGElist, contrast=c(0,1,-1,0,0,0,0,0,0,0))
Correct?
Without seeing the actual columns of the design matrix, I can't tell you if that contrast is correct. Just use the makeContrasts function to create your contrasts, and you don't have to worry about it.