Hello, I am having a lot of difficulty setting up my design matrix. I have been searching extensively online for an answer but I think I lack some of the background to understand how to do this since my experiment is fairly complicated. I think I have a basic grasp of setting up model matrices in simple experiments but contrast coding is still a complete mystery to me.
This is a RIP-Seq experiment where I pulled down a protein a of interest with an antibody and then extracted the attached RNA which was then sequenced. This was performed in either wild-type (WT) tissue or protein knock-out (KO) where the KO is assumed to represent some kind of noise or background for the experiment. WT or KO tissue is from animals injected with various drugs.
My basic design is as follows:
Condition | Genotype | Replicates |
Drug 1 | WT | 1 |
Drug 1 | WT | 2 |
Drug 1 | WT | 3 |
Drug 2 | WT | 1 |
Drug 2 | WT | 2 |
Drug 2 | WT | 3 |
Control | WT | 1 |
Control | WT | 2 |
Control | WT | 3 |
Drug 1 | KO | 1 |
Drug 1 | KO | 2 |
Drug 1 | KO | 3 |
Drug 2 | KO | 1 |
Control | KO | 1 |
Control | KO | 2 |
Control | KO | 3 |
I have tried setting up my model matrix a few different ways using model.matrix()
but I don't always see all the coefficients I am interested in. The kind of interaction I am interested in would be either Control_WT:Control_KO:Drug2_WT:Drug2_KO or Drug1_WT:Drug1_KO:Drug2_WT:Drug2_KO. In both of these cases, what I am really interested in is the WT conditions fold-change (i.e. Control over Drug2) but I want the KO noise to be modeled in as well. This 4-way interaction seems completely wrong and way too complicated. How is such modeling normally done? Limma refuses to use such columns if I try ("coefficients not estimable").
This brings me to a second question. Say I just had one coefficient ("Drug1_WT" for example). What exactly is limma giving back to me after lmfit()
> eBates()
> topTable()$AveExpr
? This condition isn't being compared to anything is it? Or would it be compared to any other coefficients that come before it if I used an additive model to set up the regression equation? The limma reference seems to suggest that the "AveExpr" column is the fold change compared to all other columns in the regression equation. If that is the case, then [limma documentation does not say this, see Gordon Smyth's answer below] should my coefficients really be Control_WT:Control_KO and then Drug2_WT:Drug2_KO? Then when I perform the fit and use topTable()
to find the Control_WT:Control_KO coefficient, the AveExpr column would reflect the fold change relative to Drug2_WT:Drug2_KO?
Sorry if this is really confusing. Any guidance on how to best set up my model matrix (and/or contrast matrix if it's even needed) would be greatly appreciated.
Yes, I think so! Thank you so much! So after making the contrast matrix, I can do something like:
What I see in the "logFC" column should be the fold change of the coefficients recalculated for the contrasts?
No, what is in the logFC is (literally, again!) the results of computing e.g., (Control_WT - Control_KO) - (Drug2_WT - Drug2_KO). It isn't actually interpretable as is; a logFC of say -2 doesn't really tell you anything more than the fact that the difference between control WT and KO samples is
smallerdifferent than the difference between drug treated WT and KO samples.In other words, you can come up with all sorts of scenarios that will give rise to an interaction coefficient of -2, that will have much different biological interpretations. As an example (and here each number represents the mean expression for a given sample type, using the example interaction contrast above):
All of these interactions will end up with a logFC of -2, but none of them can be interpreted the same way. So computing the interaction term is just the first step. You then have to somehow segregate the genes as to what pattern they follow. You can do that in a univariate sense by using e.g., ReportingTools and generating HTML tables with little plots that show what is happening for each gene, or you could take all the genes that have a significant interaction term and use kmeans clustering to segregate them into sets of genes that have the same expression patterns, or do heatmaps, or whatever.
Wow, okay so your example made me understanding contrasts. It really is a lot simpler than I have been making it out to be (which is why you keep saying "literally"!). Alright, thanks for your suggestions, this has been really helpful. I think I am finally getting the hang of it.