Question

Probem in defining the design matrix for RNA-Seq analysis with edgeR

0

Entering edit mode

Marco ▴ 20

@marco-6367

Last seen 10.7 years ago

European Union

Hi all,

I'm using edgeR to analyse my RNA-Seq data but I have a problem in selecting the right way to define the model matrix for my experiment.

I have the 4 factors: genotype (2 genotypes: geno1 and geno2); tissue (2 tissues: root, leaf); treatment (2 treatments: treat, CTRL) and time (4 time points: T0, T1, T2 and T3). I want to find differentially expressed genes using a model with the two-way interactions of the factor treatment with the others. In R terms, I want to define a formula like: ~ tissue + genotype + time + treatment + genotype:treatment + tissue:treatment + time:treatment

To create the design matrix I used the following code:

tissue<-factor(c(rep("root", 28), rep("leaf", 28)))
tissue<-relevel(tissue, ref="root")
genotype<-factor(rep(c(rep("geno1", 14), rep("geno2", 14)), 2))
genotype<-relevel(genotype, ref="geno1")
treatment<-factor(rep(c(rep(c("CTR", "CTR", "SS", "CTR", "SS", "CTR", "SS"), 4)), each=2))
treatment<-relevel(treatment, ref="CTR")
time<-factor(rep(c(rep(c("T0", "T1", "T1", "T2", "T2", "T3", "T3"), 4)), each=2))
time<-relevel(time, ref="T0")

disegno<-cbind(tissue, genotype, time, treatment)
rownames(disegno)<-colnames(data)
disegno

design<-model.matrix(~tissue + genotype + time + treatment + genotype:treatment + tissue:treatment + time:treatment)

So the design of the experiment looks like:

> disegno

                    tissue genotype time treatment
geno1.CTR0_root_a        1        1    1         1
geno1.CTR0_root_b        1        1    1         1
geno1.CTR1_root_a        1        1    2         1
geno1.CTR1_root_b        1        1    2         1
geno1.treat1_root_a        1        1    2         2
geno1.treat1_root_b        1        1    2         2
geno1.CTR2_root_a        1        1    3         1
geno1.CTR2_root_b        1        1    3         1
geno1.treat2_root_a        1        1    3         2
geno1.treat2_root_b        1        1    3         2
geno1.CTR3_root_a        1        1    4         1
geno1.CTR3_root_b        1        1    4         1
geno1.treat3_root_a        1        1    4         2
geno1.treat3_root_b        1        1    4         2
geno2.CTR0_root_a        1        2    1         1
geno2.CTR0_root_b        1        2    1         1
geno2.CTR1_root_a        1        2    2         1
geno2.CTR1_root_b        1        2    2         1
geno2.treat1_root_a        1        2    2         2
geno2.treat1_root_b        1        2    2         2
geno2.CTR2_root_a        1        2    3         1
geno2.CTR2_root_b        1        2    3         1
geno2.treat2_root_a        1        2    3         2
geno2.treat2_root_b        1        2    3         2
geno2.CTR3_root_a        1        2    4         1
geno2.CTR3_root_b        1        2    4         1
geno2.treat3_root_a        1        2    4         2
geno2.treat3_root_b        1        2    4         2
geno1.CTR0_leaf_a        2        1    1         1
geno1.CTR0_leaf_b        2        1    1         1
geno1.CTR1_leaf_a        2        1    2         1
geno1.CTR1_leaf_b        2        1    2         1
geno1.treat1_leaf_a        2        1    2         2
geno1.treat1_leaf_b        2        1    2         2
geno1.CTR2_leaf_a        2        1    3         1
geno1.CTR2_leaf_b        2        1    3         1
geno1.treat2_leaf_a        2        1    3         2
geno1.treat2_leaf_b        2        1    3         2
geno1.CTR3_leaf_a        2        1    4         1
geno1.CTR3_leaf_b        2        1    4         1
geno1.treat3_leaf_a        2        1    4         2
geno1.treat3_leaf_b        2        1    4         2
geno2.CTR0_leaf_a        2        2    1         1
geno2.CTR0_leaf_b        2        2    1         1
geno2.CTR1_leaf_a        2        2    2         1
geno2.CTR1_leaf_b        2        2    2         1
geno2.treat1_leaf_a        2        2    2         2
geno2.treat1_leaf_b        2        2    2         2
geno2.CTR2_leaf_a        2        2    3         1
geno2.CTR2_leaf_b        2        2    3         1
geno2.treat2_leaf_a        2        2    3         2
geno2.treat2_leaf_b        2        2    3         2
geno2.CTR3_leaf_a        2        2    4         1
geno2.CTR3_leaf_b        2        2    4         1
geno2.treat3_leaf_a        2        2    4         2
geno2.treat3_leaf_b        2        2    4         2

and the colnames of the design matrix of the model are:

> colnames(design)
 [1] "(Intercept)"                  "tissueleaf"                   "genotypegeno2"               
 [4] "timeT1"                       "timeT2"                       "timeT3"                      
 [7] "treatmenttreat"               "genotypegeno2:treatmenttreat" "tissueleaf:treatmenttreat"   
[10] "timeT1:treatmenttreat"        "timeT2:treatmenttreat"        "timeT3:treatmenttreat"

When I try to estimate the common dispersion value with such a model I get an error: I cannot estimate the coefficient "timeT3:treatmenttreat". I think the problem come from the fact that time point T0 is the starting point for both treated and untreated samples, so the matrix isn't a full rank matrix. In other terms I don't have any observation for the combination T0 - treated so I cannot estimate the related coefficient.

Do you know what I can do to solve the problem? Do I have to remove the T0 observations and just not consider them at all? I know an alternative should be to create a variable Group with all the combinations of the factors and then define the contrasts of interest but I would like to learn how to handle these type of situations with the classical formulas.

Any help will be greatly appreciated.

Marco

edger design and contrast matrix rnaseq • 3.2k views

ADD COMMENT • link updated 10.7 years ago by Aaron Lun ★ 28k • written 10.7 years ago by Marco ▴ 20

score 3 · Accepted Answer · 2014-10-24

3

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 14 hours ago

The city by the bay

You are right, the problem stems from your lack of treatment samples for the T0 timepoint. As a result, the treatmenttreat column in design is equal to the sum of the columns timeT1:treatmenttreat, timeT2:treatmenttreat and timeT3:treatmenttreat. This means that none of the corresponding coefficients are estimable.

To resolve this, you could drop one of the time/treatment interaction terms. For example, you could drop timeT1:treatmenttreat from design. In the resulting matrix, treatmenttreat represents the effect of the treatment at T1, rather than at T0 (which would have been unestimable). Your contrasts can then be adjusted appropriately.

Of course, the simpler solution is to switch to a one-way layout involving groups. This would have avoided parametrization problems with design in the first place.

ADD COMMENT • link 10.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron,

thank you very much for your response, but there's something I don't understand. If I drop timeT1:treatmenttreat from the design matrix (ie: I simply run the command design<-design[,-10] and than I estimate this new model with the glmFit function) why this should make the interpretation of the treatmenttreat parameter change? (When you say "In the resulting matrix, treatmenttreat represents the effect of the treatment at T1, rather than at T0" you are refering to the linear term treatmenttreat or am I misunderstanding you?)

ADD REPLY • link 10.7 years ago Marco ▴ 20

0

Entering edit mode

In your original design, the treatmenttreat term represents the effect of the treatment at time T0. timeT1:treatmenttreat represents the additional effect of the treatment at time T1. The total effect of the treatment for T1 samples is defined as treatmenttreat + timeT1:treatmenttreat.

If you remove timeT1:treatmenttreat, then the effect of the treatment for T1 samples is just defined as treatmenttreat, i.e., treatmenttreat represents the effect of the treatment at T1. The original interpretation of treatmenttreat is not feasible, given that there are no samples at T0 to estimate the effect of the treatment at that time.

The other timepoints are still treated normally. For example, for T2, the effect of the treatment is defined as treatmenttreat + timeT2:treatmenttreat. So, the latter term just represents the additional effect of the treatment at T2, relative to that at T1.

ADD REPLY • link 10.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Now I get it! Perfect explanation. Thank you very much for your time.

Marco

ADD REPLY • link 10.7 years ago Marco ▴ 20