Multi-way interaction design in limma
2
0
Entering edit mode
snamjoshi87 ▴ 40
@snamjoshi87-11184
Last seen 7.1 years ago

Hello, I am having a lot of difficulty setting up my design matrix. I have been searching extensively online for an answer but I think I lack some of the background to understand how to do this since my experiment is fairly complicated. I think I have a basic grasp of setting up model matrices in simple experiments but contrast coding is still a complete mystery to me.

This is a RIP-Seq experiment where I pulled down a protein a of interest with an antibody and then extracted the attached RNA which was then sequenced. This was performed in either wild-type (WT) tissue or protein knock-out (KO) where the KO is assumed to represent some kind of noise or background for the experiment. WT or KO tissue is from animals injected with various drugs.

My basic design is as follows:

Condition Genotype Replicates
Drug 1 WT 1
Drug 1 WT 2
Drug 1 WT 3
Drug 2 WT 1
Drug 2 WT 2
Drug 2 WT 3
Control WT 1
Control WT 2
Control WT 3
Drug 1 KO 1
Drug 1 KO 2
Drug 1 KO 3
Drug 2 KO 1
Control KO 1
Control KO 2
Control KO 3

I have tried setting up my model matrix a few different ways using model.matrix() but I don't always see all the coefficients I am interested in. The kind of interaction I am interested in would be either Control_WT:Control_KO:Drug2_WT:Drug2_KO or Drug1_WT:Drug1_KO:Drug2_WT:Drug2_KO. In both of these cases, what I am really interested in is the WT conditions fold-change (i.e. Control over Drug2) but I want the KO noise to be modeled in as well. This 4-way interaction seems completely wrong and way too complicated. How is such modeling normally done? Limma refuses to use such columns if I try ("coefficients not estimable").

This brings me to a second question. Say I just had one coefficient ("Drug1_WT" for example). What exactly is limma giving back to me after lmfit() > eBates() > topTable()$AveExpr? This condition isn't being compared to anything is it? Or would it be compared to any other coefficients that come before it if I used an additive model to set up the regression equation? The limma reference seems to suggest that the "AveExpr" column is the fold change compared to all other columns in the regression equation. If that is the case, then [limma documentation does not say this, see Gordon Smyth's answer below] should my coefficients really be Control_WT:Control_KO and then Drug2_WT:Drug2_KO? Then when I perform the fit and use topTable() to find the Control_WT:Control_KO coefficient, the AveExpr column would reflect the fold change relative to Drug2_WT:Drug2_KO?

Sorry if this is really confusing. Any guidance on how to best set up my model matrix (and/or contrast matrix if it's even needed) would be greatly appreciated.

limma design matrix • 1.5k views
ADD COMMENT
4
Entering edit mode
@james-w-macdonald-5106
Last seen 16 hours ago
United States

The best way to proceed in a situation like this is to just compute the mean expression for all groups that you care about and then do the contrasts by hand. This is actually pretty simple - it's just algebra - so don't over think things.

You care about groups like Drug2_KO, which is the set of KO animals that got drug 2, right? So you can set up your model matrix like

grps <- paste(df$Condition, df$Genotype, sep = "_")

## where df is the name of your data.frame containing the group information

design <- model.matrix(~ 0 + grps)

## this will now compute the mean of each group

colnames(design) <- gsub("grps", "", colnames(design))

## get rid of extraneous cruft in the column names

contrast <- makeContrast(interaction1 = (Control_WT - Control_KO) - (Drug2_WT - Drug2_KO), levels = design)

This contrast will now calculate (literally!) (Control_WT - Control_KO) - (Drug2_WT - Drug2_KO), which is the interaction between KO and WT according to drug treatment. Or conversely, the interaction between drug and control according to genotype. Or in English, it finds genes that react to the drug2 treatment differently, depending on if the animals are WT or KO.

You can add any number of different interactions, to find genes that react to the different drugs depending on the genotype status by emulating what I have already done.

Does that help?

ADD COMMENT
0
Entering edit mode

Yes, I think so! Thank you so much! So after making the contrast matrix, I can do something like:

fit <- lmFit(logCPM, design)
fitCont <- contrasts.fit(fit, contrast)
fitCont <- eBayes(fitCont)
results <- topTable(fitCont)

What I see in the "logFC" column should be the fold change of the coefficients recalculated for the contrasts?

ADD REPLY
1
Entering edit mode

No, what is in the logFC is (literally, again!) the results of computing e.g., (Control_WT - Control_KO) - (Drug2_WT - Drug2_KO). It isn't actually interpretable as is; a logFC of say -2 doesn't really tell you anything more than the fact that the difference between control WT and KO samples is smaller different than the difference between drug treated WT and KO samples.

In other words, you can come up with all sorts of scenarios that will give rise to an interaction coefficient of -2, that will have much different biological interpretations. As an example (and here each number represents the mean expression for a given sample type, using the example interaction contrast above):

no change for control, decrease for drug2 treatment in KO

(5 - 5) - (7 - 5)

increase in KO, but mediated when treated with drug2

(5 - 9) - (5 - 7)

decrease in KO for controls, but larger decrease in KO when drug2 treated

(7 - 5) - (9 - 5)

increase in KO for controls, no change when drug treated

(5 - 7) - (5 - 5)

All of these interactions will end up with a logFC of -2, but none of them can be interpreted the same way. So computing the interaction term is just the first step. You then have to somehow segregate the genes as to what pattern they follow. You can do that in a univariate sense by using e.g., ReportingTools and generating HTML tables with little plots that show what is happening for each gene, or you could take all the genes that have a significant interaction term and use kmeans clustering to segregate them into sets of genes that have the same expression patterns, or do heatmaps, or whatever.

ADD REPLY
0
Entering edit mode

Wow, okay so your example made me understanding contrasts. It really is a lot simpler than I have been making it out to be (which is why you keep saying "literally"!). Alright, thanks for your suggestions, this has been really helpful. I think I am finally getting the hang of it.
 

ADD REPLY
1
Entering edit mode
@gordon-smyth
Last seen 13 minutes ago
WEHI, Melbourne, Australia

Your experimental design is a pretty standard one and is dealt with in Section 9.5 of the limma User's Guide. James has recommended that you follow the approach described in Section 9.5.2.

I am puzzled by your remarks about AveExpr. The limma User's Guide (page 60) says that

"The AveExpr column gives the average log2-expression level for that gene across all the arrays and channels in the experiment."

help("topTable") says the same thing:

"AveExpr: average log2-expression for the probe over all arrays and channels"

Is that not clear? I can't see how either of these statements could possibly suggest a fold change compared to anything, let alone to "all the other columns" as you've interpreted it as.

ADD COMMENT
0
Entering edit mode

No, it's clear. I am working with multiple software at the same time and I think I mistakenly read something from a different documentation. Sorry about that. I will modify my question so I don't confuse anyone else who reads it.

ADD REPLY
1
Entering edit mode

OK, thanks. If there is misleading documentation somewhere, let us know so we can fix it.

ADD REPLY

Login before adding your answer.

Traffic: 859 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6