Hi,
I am struggling to design a model for my differential expression analysis and I was hoping for some advice.
I have two different vaccines and within each vaccine group, I have four timepoints from a single donor (with some missing values). I also have a batch variable that I need to control for, as the libraries were generated and sequenced in three batches. As the timepoints are pre-prime (T1) and post-prime (T2) and pre-boost (T3) and post-boost (T4), I am interested in the following comparisons:
- T2 vs. T1 for vaccine A - response to prime dose of vaccine A
- T4 vs. T3 for vaccine A - response to boost dose of vaccine A
- T2 vs. T1 for vaccine B - response to prime dose of vaccine B
- T4 vs. T3 for vaccine B - response to boost dose of vaccine B
- (T2 vs. T1 for vaccine A) compared with (T2 vs. T1 for vaccine B) - how does prime response differ for the two vaccines?
- (T4 vs. T3 for vaccine A) compared with (T4 vs. T3 for vaccine B) - how does boost response differ for the two vaccines?
- (T4 vs. T3 for vaccine A) compared with (T2 vs. T1 for vaccine A) - how does boost response compare to prime response for vaccine A?
- (T4 vs. T3 for vaccine B) compared with (T2 vs. T1 for vaccine B) - how does boost response compare to prime response for vaccine B?
My samples table looks something like this (but I have ~ 10 donors per vaccine group and not all donors have every timepoint):
vaccine | timepoint | donor | batch |
---|---|---|---|
A | T1 | D1 | B1 |
A | T2 | D1 | B1 |
A | T3 | D1 | B2 |
A | T4 | D1 | B1 |
A | T1 | D2 | B2 |
A | T2 | D2 | B1 |
A | T3 | D2 | B1 |
A | T4 | D2 | B1 |
A | T1 | D3 | B2 |
A | T2 | D3 | B1 |
A | T3 | D3 | B3 |
A | T4 | D3 | B1 |
B | T1 | D4 | B2 |
B | T2 | D4 | B1 |
B | T3 | D4 | B3 |
B | T4 | D4 | B3 |
B | T1 | D5 | B1 |
B | T2 | D5 | B1 |
B | T3 | D5 | B2 |
B | T4 | D5 | B3 |
B | T1 | D6 | B1 |
B | T2 | D6 | B2 |
B | T3 | D6 | B2 |
B | T4 | D6 | B3 |
One option would be to split the samples into prime (T1, T2) and boost (T3, T4) timepoints and set up the model in a similar way to the "Comparisons both between and within subjects" example in the edgeR manual. However, this would then not allow me to compare the difference between prime and boost within a vaccine.
Any thoughts would be most appreciated.
Best wishes,
Lucy
Thank you. I have a design matrix with the following columns: donor1, donor2, donor3, ..., batch2, batch3, vaccineA.T2, vaccineB.T2, vaccineA.T3, vaccineB.T3, vaccineA.T4, vaccineB.T4
The main comparisons I am unsure how to make are:
Does the following make sense:
Then e.g. for boost vs. prime, vaccine A:
The design matrix columns and contrasts look correct in principle. Why are you unsure?
Keep in mind that I have not seen either your complete data or the commands used to create the design matrix, so I can't comment on whether the batch correction is correct.
Thank you. Regarding batch correction, I detected a batch effect by PCA that could be successfully removed with ComBat, so I'm hoping that edgeR will also be able to deal with the batch effect. However, since the T1 timepoint for each donor/vaccine combination can only be in one batch, I don't know whether this could cause issues?
The design matrix was created as below following the example at EdgeR - Model matrix for complex model similar to user guide 3.5 example - but more complex.
How would the coefficients be interpreted in this scenario:
For me, it would be more intuitive to write the boost vs. prime contrasts as below, but this doesn't fit the format of the design matrix. I think this is where my confusion comes in.
I am keen to update my understanding of GLMs, so I would be happy to be pointed to some useful books or resources.
Gordon Smyth, it would be great to understand if my interpretation of the coefficients is correct or what I am missing?