Question

Design matrix for unbalanced data

0

Entering edit mode

Yogender Gowtham ▴ 20

@yogender-gowtham-6684

Last seen 9.8 years ago

Hello All

Is this the correct design matrix for the following experiment? The experiment has a Normal cell line (P) and a Mutant cell line (X) that were grown with decreasing nutrient levels, A, B, and C. A is represent 100% nutrient for a particular component, B represent 50% , and C represents 0%. The Normal cells (P) could only grow in A and B, while the Mutant cells (X) could be grown in A, B, and C. The objective is of the study was to identify gene expression levels that changed (i.e., gene sensitive to the nutrient) that allowed the mutant to grow without the nutrient.

Design Matrix:

celltype <- factor(samples$ct, levels = c("P", "X"))
trt <- factor(samples$trt, levels= c("A", "B", "C"))
design <- model.matrix(~0+celltype+trt)
rownames(design) <- samples$file
design
colnames(design)
#[1] "celltypeP" "celltypeX" "trtB" "trtC"

Thanks

edgeR RNA-Seq design matrix multifactorial design • 2.8k views

ADD COMMENT • link 9.9 years ago Yogender Gowtham ▴ 20

score 1 · Answer 1 · 2016-02-27

1

Entering edit mode

Aaron Lun ★ 29k

@alun

Last seen 2 hours ago

The city by the bay

The additive design won't account for interactions between celltype and trt, which would seem to be important if you want to identify mutant-specific responses to the nutrient level. I would suggest this:

grouping <- paste0(celltype, ".", trt)
grouping <- factor(grouping)
design <- model.matrix(~0 + grouping)
colnames(design) <- levels(grouping)

To identify relevant genes, you can test for genes that are DE in the mutant with respect to decreasing nutrient concentration (e.g., between X.C and X.B). This will pick up any genes that get turned on in the mutant when the concentration hits zero, and may be involved in mediating cell survival. You can't contrast this to normal cells, because they don't survive at zero, but you can do something equivalent with the concentrations that you do have:

con <- makeContrasts((X.B - X.A) - (P.B - P.A), levels=design)

This will identify genes that respond to decreasing concentrations in a different manner between normal and mutant cells, which may be involved in predisposing the mutants to survival (or the normal cells to death) when the concentration hits zero. Finally, you can do a straight up comparison between the genotype averages across all concentrations:

con <- makeContrasts((X.A + X.B + X.C)/3 - (P.B + P.A)/2, levels=design)

This will identify genes that are constitutively DE between genotypes and may be involved in driving mutant survival.

ADD COMMENT • link 9.9 years ago Aaron Lun ★ 29k

0

Entering edit mode

Hello Aaron

Thank you once again for your answers. I have some questions related to this experimental setup. I would like to pursue an unbalanced 2-factor ANOVA analysis to look at the differential genes involved in the main effects of treatment and cell type along with interaction. I know you mention a simpler one-way layout, but I'm curious to know how it would work with a 2-way layout. Since the data is unbalanced, I'm not entirely sure on how to drop the coefficients to answer my questions (1) interaction of cell type and nutrient level, 2) main effect of nutrient, irrespective of cell type and 3) main effect of cell type, irrespective of nutrient. Let me remind you about my experimental setup: a Normal cell line (P) and a Mutant cell line (X) that were grown with decreasing nutrient levels, A, B, and C. A is represent 100% nutrient for a particular component, B represent 50% , and C represents 0%. The Normal cells (P) could only grow in A and B, while the Mutant cells (X) could be grown in A, B, and C.

celltype <- factor(samples$ct, levels = c("P", "X"))
trt <- factor(samples$trt, levels= c("A", "B", "C")) 
design <- model.matrix(~celltype+trt)
rownames(design) <- samples$file
design
colnames(design) 
#[1] "(Intercept)"  "celltypeX" "trtB"         "trtC"

Could you please help me as to how I could use the above model to understand the main effects?

design <- model.matrix(~celltype*trt)
rownames(design) <- samples$file
design
colnames(design) 
#[1] "(Intercept)"       "celltypeX"      "trtB"              "trtC"              "celltypeX:trtB" "celltypeX:trtC"

The same for this model above, how can I use it to understand the interaction effects? I'm don't understand what this term means: celltypeX:trtC (as cell type P did not grow at nutrient level C)

Thanks

ADD REPLY • link 9.9 years ago Yogender Gowtham ▴ 20

0

Entering edit mode

For your first model, the intercept represents the average log-expression of P cells in treatment A. celltypeX represents the log-fold change of X over P, i.e., the main effect of the cell type. trtB represents the log-fold change of B over A, and trtC represents the log-fold change of C over A, i.e., the main effects of treatment. You can drop these to test for each effect. However, this assumes that there are no significant interaction effects. If there are, you generally can't interpret the main effects in a sensible manner. For example, if a gene is upregulated in the mutant against normal in treatment A but downregulated in mutant against normal for treatment B, what would be the main effect of the mutation? Which direction would it be in? The simplicity of the additive model belies the strength of its underlying assumptions.

Anyway, for your second model, the intercept represents the average log-expression of P cells in treatment A. celltypeX represents the log-fold change of X over P in treatment A only. trtB represents the log-fold change of B over A in P cells only (but trtC is the same as before). celltypeX:trtB represents the interaction between the effect of B treatment in X cells. celltypeX:trtC has no meaning (it's an all-zero column) and should be removed. You can see how annoying this is to interpret, so it's easier (and statistically equivalent) to parametrize it as a one-way layout as I described originally.

ADD REPLY • link 9.9 years ago Aaron Lun ★ 29k

0

Entering edit mode

Hello

I keep coming back with more questions. Could you please help me understand how and why a one-way design layout is statistically equivalent to a 2-factor or a 3-factor design matrix? It would be helpful if you could point out some kind of reference or a paper. How does the layout change when a factor is an ordinal type value?

Our basic understanding is that in traditional statistics, you perform the ANOVA analysis and then perform the post-hoc testing. Will this still be possible from the one-way layout too? In the example above, how can I interpret the main effect of the nutrient?

Thanks

ADD REPLY • link 9.9 years ago Yogender Gowtham ▴ 20

0

Entering edit mode

These general statistics questions are getting beyond the scope of the original post, and indeed, beyond the scope of this forum. All I'll say is this:

If it's a one-way layout, the design matrix should be constructed from only one factor containing a number of levels (i.e., groups). The design matrix can still have multiple coefficients, though.
If your design matrix involves multiple factors added together during its construction, then it's probably not a one-way layout.
I've never known anyone doing post hoc testing on the ANOVA output from limma. If you want specific contrasts, just test them directly.

ADD REPLY • link 9.9 years ago Aaron Lun ★ 29k

0

Entering edit mode

Thanks Aaron.

Well, So my first question had two factors - cell type and nutrient level. But you had suggested the grouping the factors together (which, if I understand correctly, has turned the model into a one-factor layout). Now, just so I understand clearly, are you also suggesting that I use the grouping design to ask specific pairwise questions rather than performing ANOVA analysis before I do the pairwise comparisons?

ADD REPLY • link 9.9 years ago Yogender Gowtham ▴ 20

0

Entering edit mode

Yes. Obviously, you can still do an ANOVA across all groups:

con <- makeContrasts(P.A - P.B, P.A - X.A, 
   P.A - X.B, P.A - X.C, levels=design)

... but it'll be harder to relate the results to your biological question, compared to doing specific pairwise comparisons where you know where the DE is occurring for each gene.

ADD REPLY • link 9.9 years ago Aaron Lun ★ 29k

Gordon Smyth · Answer 2 · 2016-02-27

0

Entering edit mode

Yogender Gowtham ▴ 20

@yogender-gowtham-6684

Last seen 9.8 years ago

Thanks Aaron. I actually had another design where it is same as you have suggested. Please correct me if I am wrong.

design <- model.matrix(~0+group)
rownames(design) <- samples$file
design
colnames(design)
#[1] "groupP_A"  "groupP_B"    "groupX_C"  "groupX_A" "groupX_B"

#For interaction

lrt <- glmLRT(glmfit, contrast = c(-1,1,0,-1,1))

For the second suggestion, could I still use contrast as:

lrt1 <- glmLRT(glmfit, contrast = c(-0.5,-0.5,0.33,0.33,0.33))

Would this give me the same answer too as (X.A+X.B+X.C)/3 - (P.B+P.A)/2 contrast?

Thanks

ADD COMMENT • link updated 9.9 years ago by Gordon Smyth 53k • written 9.9 years ago by Yogender Gowtham ▴ 20

1

Entering edit mode

It should, though I'm a bit perplexed as to how you got (Intercept) as a column name with an intercept-free model. Also, please post your responses via "Add comment", rather than starting a new "Answer".

ADD REPLY • link 9.9 years ago Aaron Lun ★ 29k

0

Entering edit mode

My apologies. I didn't notice the add comment button. And Yes, you're right about the Intercept. It should say groupP_A. Thanks

ADD REPLY • link 9.9 years ago Yogender Gowtham ▴ 20