I am following the advice: "If you can't find an answer to your question within one hour - ask someone!"
So here I write, my question regarding model matrices in limma or DESeq2.
I will copy a reprex first:
# Create data frame with information.
groups <- data.frame(group = c(rep(1,6), rep(2,6), rep(3,6), rep(4,6)),
subject = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7,8,8,8),
treatment = c(rep("a",6),rep("b",6),rep("a",6),rep("b",6)),
time = c(rep(c(1,2,3), 4), rep(c(1,4,5), 4)))
# Setting levels for fixed effects.
groups$treatment <- factor(groups$treatment)
groups$time <- factor(groups$time)
groups$group <- factor(groups$group)
# Creating design formula.
design <- model.matrix( ~ 0 + treatment + treatment:time, data = groups)
design
#> treatmenta treatmentb treatmenta:time2 treatmentb:time2
#> 1 1 0 0 0
#> 2 1 0 1 0
#> 3 1 0 0 0
#> 4 1 0 0 0
#> 5 1 0 1 0
#> 6 1 0 0 0
#> 7 0 1 0 0
#> 8 0 1 0 1
#> 9 0 1 0 0
#> 10 0 1 0 0
#> 11 0 1 0 1
#> 12 0 1 0 0
#> 13 1 0 0 0
#> 14 1 0 0 0
#> 15 1 0 0 0
#> 16 1 0 0 0
#> 17 1 0 0 0
#> 18 1 0 0 0
#> 19 0 1 0 0
#> 20 0 1 0 0
#> 21 0 1 0 0
#> 22 0 1 0 0
#> 23 0 1 0 0
#> 24 0 1 0 0
#> treatmenta:time3 treatmentb:time3 treatmenta:time4 treatmentb:time4
#> 1 0 0 0 0
#> 2 0 0 0 0
#> 3 1 0 0 0
#> 4 0 0 0 0
#> 5 0 0 0 0
#> 6 1 0 0 0
#> 7 0 0 0 0
#> 8 0 0 0 0
#> 9 0 1 0 0
#> 10 0 0 0 0
#> 11 0 0 0 0
#> 12 0 1 0 0
#> 13 0 0 0 0
#> 14 0 0 1 0
#> 15 0 0 0 0
#> 16 0 0 0 0
#> 17 0 0 1 0
#> 18 0 0 0 0
#> 19 0 0 0 0
#> 20 0 0 0 1
#> 21 0 0 0 0
#> 22 0 0 0 0
#> 23 0 0 0 1
#> 24 0 0 0 0
#> treatmenta:time5 treatmentb:time5
#> 1 0 0
#> 2 0 0
#> 3 0 0
#> 4 0 0
#> 5 0 0
#> 6 0 0
#> 7 0 0
#> 8 0 0
#> 9 0 0
#> 10 0 0
#> 11 0 0
#> 12 0 0
#> 13 0 0
#> 14 0 0
#> 15 1 0
#> 16 0 0
#> 17 0 0
#> 18 1 0
#> 19 0 0
#> 20 0 0
#> 21 0 1
#> 22 0 0
#> 23 0 0
#> 24 0 1
#> attr(,"assign")
#> [1] 1 1 2 2 2 2 2 2 2 2
#> attr(,"contrasts")
#> attr(,"contrasts")$treatment
#> [1] "contr.treatment"
#>
#> attr(,"contrasts")$time
#> [1] "contr.treatment"
Created on 2019-07-03 by the reprex package (v0.3.0)
As one can see, the dataframe created contains four columns: Group, Subject, Treatment, and Time. We have repeated measures for the subjects at different time points, and we have four groups, each group receiving a treatment at a specific time point. There is no control group in a "classical" sense. We are basically interested in differences between the groups, and this can mean differences occurring as a consequence of the treatment or differences with the same treatment but occurring as a consequence of different intervention time points.
I am bit thrown off right now because of the missing control I guess.
Modeling this experimental setup using an intercept (~ treatment + treatment:time
) shows, when I create a model matrix, an intercept representing the mean of all samples (each row in column "intercept" in the model matrix contains "1"). Each coefficient will be based on this mean (difference to this mean). This is of course not what I want. Different groups will represent the baseline within a specific comparison/contrast.
So I created the design formula with ~ 0 + treament + treatment:time
. The design matrix is shown in my reprex code. If I want to look for the comparison of group 4 between time point 1 compared to time point 5, how does my contrast look like?
If I use the contrast "group4:time5" in the reprex, what is the comparison here?
Thanks for everybody's help. I appreciate it. Especially since this might not be a new question and maybe someone already wrote a similar question which was answered in the past.
PS. I am not worried about the subject information at this point. My plan is to use limma and provide subjects as a blocking factor, or use the dream function with subject as a random effect.
Thanks for your swift response!
I was assuming that I do make some mistakes in interpreting the model matrix. Your clarification helps!
Sorry for the group4:time5 mixup... that was my mistake and you are right, that does not exist in the model matrix.
But at the end this is (among other contrasts) what I want: A contrast with only group 4 samples between time point 1 and 5. Group 4 belongs to treatment b, but so is also group 2 (see the table created at the beginning of my example). But since group 2 has a different schedule (and my real data is even more complex), there is a case where I really just want group 4 time point 1 vs. 5. I do assume right now, this is not possible with the current matrix. But I am also not sure how to construct the model matrix differently. Including "group" as a fixed factor too?
The usual advice in this situation is to forgo using the conventional R model building, and instead re-formulate your data so you can compute the mean of each group. So something like
will give you a vector of individual groups that you can use to generate the model matrix (for which you should use ~0 so you compute the mean of each group, and can make comparisons using contrasts), and you can then make any comparison you would like. However, you also need to account for the fact that you have pairing, so you need to reformulate the subjects so you can have a full rank design matrix. There is an example that may be applicable in section 3.5 of the edgeR User's Guide. If this seems mysterious to you, I recommend you find somebody local for whom it is not. It's pretty easy to botch the job if you don't really know what you are doing.
reprex is a convenient way to share code that is self sufficient aka reproducible. https://cran.r-project.org/web/packages/reprex/index.html Best.
Wow. It's come to that? The mind, it boggles.