0
2.9 years ago by
snamjoshi8730
snamjoshi8730 wrote:

Hello, I am having a lot of difficulty setting up my design matrix. I have been searching extensively online for an answer but I think I lack some of the background to understand how to do this since my experiment is fairly complicated. I think I have a basic grasp of setting up model matrices in simple experiments but contrast coding is still a complete mystery to me.

This is a RIP-Seq experiment where I pulled down a protein a of interest with an antibody and then extracted the attached RNA which was then sequenced. This was performed in either wild-type (WT) tissue or protein knock-out (KO) where the KO is assumed to represent some kind of noise or background for the experiment. WT or KO tissue is from animals injected with various drugs.

My basic design is as follows:

 Condition Genotype Replicates Drug 1 WT 1 Drug 1 WT 2 Drug 1 WT 3 Drug 2 WT 1 Drug 2 WT 2 Drug 2 WT 3 Control WT 1 Control WT 2 Control WT 3 Drug 1 KO 1 Drug 1 KO 2 Drug 1 KO 3 Drug 2 KO 1 Control KO 1 Control KO 2 Control KO 3

I have tried setting up my model matrix a few different ways using model.matrix() but I don't always see all the coefficients I am interested in. The kind of interaction I am interested in would be either Control_WT:Control_KO:Drug2_WT:Drug2_KO or Drug1_WT:Drug1_KO:Drug2_WT:Drug2_KO. In both of these cases, what I am really interested in is the WT conditions fold-change (i.e. Control over Drug2) but I want the KO noise to be modeled in as well. This 4-way interaction seems completely wrong and way too complicated. How is such modeling normally done? Limma refuses to use such columns if I try ("coefficients not estimable").

This brings me to a second question. Say I just had one coefficient ("Drug1_WT" for example). What exactly is limma giving back to me after lmfit() > eBates() > topTable()$AveExpr? This condition isn't being compared to anything is it? Or would it be compared to any other coefficients that come before it if I used an additive model to set up the regression equation? The limma reference seems to suggest that the "AveExpr" column is the fold change compared to all other columns in the regression equation. If that is the case, then [limma documentation does not say this, see Gordon Smyth's answer below] should my coefficients really be Control_WT:Control_KO and then Drug2_WT:Drug2_KO? Then when I perform the fit and use topTable() to find the Control_WT:Control_KO coefficient, the AveExpr column would reflect the fold change relative to Drug2_WT:Drug2_KO? Sorry if this is really confusing. Any guidance on how to best set up my model matrix (and/or contrast matrix if it's even needed) would be greatly appreciated. limma design matrix • 515 views ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by snamjoshi8730 Answer: Mult-way interaction design in limma 4 2.9 years ago by United States James W. MacDonald51k wrote: The best way to proceed in a situation like this is to just compute the mean expression for all groups that you care about and then do the contrasts by hand. This is actually pretty simple - it's just algebra - so don't over think things. You care about groups like Drug2_KO, which is the set of KO animals that got drug 2, right? So you can set up your model matrix like grps <- paste(df$Condition, df\$Genotype, sep = "_")

## where df is the name of your data.frame containing the group information

design <- model.matrix(~ 0 + grps)

## this will now compute the mean of each group

colnames(design) <- gsub("grps", "", colnames(design))

## get rid of extraneous cruft in the column names

contrast <- makeContrast(interaction1 = (Control_WT - Control_KO) - (Drug2_WT - Drug2_KO), levels = design)

This contrast will now calculate (literally!) (Control_WT - Control_KO) - (Drug2_WT - Drug2_KO), which is the interaction between KO and WT according to drug treatment. Or conversely, the interaction between drug and control according to genotype. Or in English, it finds genes that react to the drug2 treatment differently, depending on if the animals are WT or KO.

You can add any number of different interactions, to find genes that react to the different drugs depending on the genotype status by emulating what I have already done.

Does that help?

Yes, I think so! Thank you so much! So after making the contrast matrix, I can do something like:

fit <- lmFit(logCPM, design)
fitCont <- contrasts.fit(fit, contrast)
fitCont <- eBayes(fitCont)
results <- topTable(fitCont)

What I see in the "logFC" column should be the fold change of the coefficients recalculated for the contrasts?

1

No, what is in the logFC is (literally, again!) the results of computing e.g., (Control_WT - Control_KO) - (Drug2_WT - Drug2_KO). It isn't actually interpretable as is; a logFC of say -2 doesn't really tell you anything more than the fact that the difference between control WT and KO samples is smaller different than the difference between drug treated WT and KO samples.

In other words, you can come up with all sorts of scenarios that will give rise to an interaction coefficient of -2, that will have much different biological interpretations. As an example (and here each number represents the mean expression for a given sample type, using the example interaction contrast above):

no change for control, decrease for drug2 treatment in KO

(5 - 5) - (7 - 5)

increase in KO, but mediated when treated with drug2

(5 - 9) - (5 - 7)

decrease in KO for controls, but larger decrease in KO when drug2 treated

(7 - 5) - (9 - 5)

increase in KO for controls, no change when drug treated

(5 - 7) - (5 - 5)

All of these interactions will end up with a logFC of -2, but none of them can be interpreted the same way. So computing the interaction term is just the first step. You then have to somehow segregate the genes as to what pattern they follow. You can do that in a univariate sense by using e.g., ReportingTools and generating HTML tables with little plots that show what is happening for each gene, or you could take all the genes that have a significant interaction term and use kmeans clustering to segregate them into sets of genes that have the same expression patterns, or do heatmaps, or whatever.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by James W. MacDonald51k

Wow, okay so your example made me understanding contrasts. It really is a lot simpler than I have been making it out to be (which is why you keep saying "literally"!). Alright, thanks for your suggestions, this has been really helpful. I think I am finally getting the hang of it.

1
2.9 years ago by
Gordon Smyth38k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth38k wrote:

Your experimental design is a pretty standard one and is dealt with in Section 9.5 of the limma User's Guide. James has recommended that you follow the approach described in Section 9.5.2.

I am puzzled by your remarks about AveExpr. The limma User's Guide (page 60) says that

"The AveExpr column gives the average log2-expression level for that gene across all the arrays and channels in the experiment."

help("topTable") says the same thing:

"AveExpr: average log2-expression for the probe over all arrays and channels"

Is that not clear? I can't see how either of these statements could possibly suggest a fold change compared to anything, let alone to "all the other columns" as you've interpreted it as.

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by Gordon Smyth38k

No, it's clear. I am working with multiple software at the same time and I think I mistakenly read something from a different documentation. Sorry about that. I will modify my question so I don't confuse anyone else who reads it.

1

OK, thanks. If there is misleading documentation somewhere, let us know so we can fix it.