Hi all,
I'm using edgeR to analyse my RNA-Seq data but I have a problem in selecting the right way to define the model matrix for my experiment.
I have the 4 factors: genotype (2 genotypes: geno1 and geno2); tissue (2 tissues: root, leaf); treatment (2 treatments: treat, CTRL) and time (4 time points: T0, T1, T2 and T3). I want to find differentially expressed genes using a model with the two-way interactions of the factor treatment with the others. In R terms, I want to define a formula like: ~ tissue + genotype + time + treatment + genotype:treatment + tissue:treatment + time:treatment
To create the design matrix I used the following code:
tissue<-factor(c(rep("root", 28), rep("leaf", 28))) tissue<-relevel(tissue, ref="root") genotype<-factor(rep(c(rep("geno1", 14), rep("geno2", 14)), 2)) genotype<-relevel(genotype, ref="geno1") treatment<-factor(rep(c(rep(c("CTR", "CTR", "SS", "CTR", "SS", "CTR", "SS"), 4)), each=2)) treatment<-relevel(treatment, ref="CTR") time<-factor(rep(c(rep(c("T0", "T1", "T1", "T2", "T2", "T3", "T3"), 4)), each=2)) time<-relevel(time, ref="T0") disegno<-cbind(tissue, genotype, time, treatment) rownames(disegno)<-colnames(data) disegno design<-model.matrix(~tissue + genotype + time + treatment + genotype:treatment + tissue:treatment + time:treatment)
So the design of the experiment looks like:
> disegno tissue genotype time treatment geno1.CTR0_root_a 1 1 1 1 geno1.CTR0_root_b 1 1 1 1 geno1.CTR1_root_a 1 1 2 1 geno1.CTR1_root_b 1 1 2 1 geno1.treat1_root_a 1 1 2 2 geno1.treat1_root_b 1 1 2 2 geno1.CTR2_root_a 1 1 3 1 geno1.CTR2_root_b 1 1 3 1 geno1.treat2_root_a 1 1 3 2 geno1.treat2_root_b 1 1 3 2 geno1.CTR3_root_a 1 1 4 1 geno1.CTR3_root_b 1 1 4 1 geno1.treat3_root_a 1 1 4 2 geno1.treat3_root_b 1 1 4 2 geno2.CTR0_root_a 1 2 1 1 geno2.CTR0_root_b 1 2 1 1 geno2.CTR1_root_a 1 2 2 1 geno2.CTR1_root_b 1 2 2 1 geno2.treat1_root_a 1 2 2 2 geno2.treat1_root_b 1 2 2 2 geno2.CTR2_root_a 1 2 3 1 geno2.CTR2_root_b 1 2 3 1 geno2.treat2_root_a 1 2 3 2 geno2.treat2_root_b 1 2 3 2 geno2.CTR3_root_a 1 2 4 1 geno2.CTR3_root_b 1 2 4 1 geno2.treat3_root_a 1 2 4 2 geno2.treat3_root_b 1 2 4 2 geno1.CTR0_leaf_a 2 1 1 1 geno1.CTR0_leaf_b 2 1 1 1 geno1.CTR1_leaf_a 2 1 2 1 geno1.CTR1_leaf_b 2 1 2 1 geno1.treat1_leaf_a 2 1 2 2 geno1.treat1_leaf_b 2 1 2 2 geno1.CTR2_leaf_a 2 1 3 1 geno1.CTR2_leaf_b 2 1 3 1 geno1.treat2_leaf_a 2 1 3 2 geno1.treat2_leaf_b 2 1 3 2 geno1.CTR3_leaf_a 2 1 4 1 geno1.CTR3_leaf_b 2 1 4 1 geno1.treat3_leaf_a 2 1 4 2 geno1.treat3_leaf_b 2 1 4 2 geno2.CTR0_leaf_a 2 2 1 1 geno2.CTR0_leaf_b 2 2 1 1 geno2.CTR1_leaf_a 2 2 2 1 geno2.CTR1_leaf_b 2 2 2 1 geno2.treat1_leaf_a 2 2 2 2 geno2.treat1_leaf_b 2 2 2 2 geno2.CTR2_leaf_a 2 2 3 1 geno2.CTR2_leaf_b 2 2 3 1 geno2.treat2_leaf_a 2 2 3 2 geno2.treat2_leaf_b 2 2 3 2 geno2.CTR3_leaf_a 2 2 4 1 geno2.CTR3_leaf_b 2 2 4 1 geno2.treat3_leaf_a 2 2 4 2 geno2.treat3_leaf_b 2 2 4 2
and the colnames of the design matrix of the model are:
> colnames(design) [1] "(Intercept)" "tissueleaf" "genotypegeno2" [4] "timeT1" "timeT2" "timeT3" [7] "treatmenttreat" "genotypegeno2:treatmenttreat" "tissueleaf:treatmenttreat" [10] "timeT1:treatmenttreat" "timeT2:treatmenttreat" "timeT3:treatmenttreat"
When I try to estimate the common dispersion value with such a model I get an error: I cannot estimate the coefficient "timeT3:treatmenttreat". I think the problem come from the fact that time point T0 is the starting point for both treated and untreated samples, so the matrix isn't a full rank matrix. In other terms I don't have any observation for the combination T0 - treated so I cannot estimate the related coefficient.
Do you know what I can do to solve the problem? Do I have to remove the T0 observations and just not consider them at all? I know an alternative should be to create a variable Group with all the combinations of the factors and then define the contrasts of interest but I would like to learn how to handle these type of situations with the classical formulas.
Any help will be greatly appreciated.
Marco
Hi Aaron,
thank you very much for your response, but there's something I don't understand. If I drop
timeT1:treatmenttreat
from the design matrix (ie: I simply run the commanddesign<-design[,-10]
and than I estimate this new model with the glmFit function) why this should make the interpretation of thetreatmenttreat
parameter change? (When you say "In the resulting matrix,treatmenttreat
represents the effect of the treatment at T1, rather than at T0" you are refering to the linear termtreatmenttreat
or am I misunderstanding you?)In your original
design
, thetreatmenttreat
term represents the effect of the treatment at time T0.timeT1:treatmenttreat
represents the additional effect of the treatment at time T1. The total effect of the treatment for T1 samples is defined astreatmenttreat + timeT1:treatmenttreat
.If you remove
timeT1:treatmenttreat
, then the effect of the treatment for T1 samples is just defined astreatmenttreat
, i.e.,treatmenttreat
represents the effect of the treatment at T1. The original interpretation oftreatmenttreat
is not feasible, given that there are no samples at T0 to estimate the effect of the treatment at that time.The other timepoints are still treated normally. For example, for T2, the effect of the treatment is defined as
treatmenttreat + timeT2:treatmenttreat
. So, the latter term just represents the additional effect of the treatment at T2, relative to that at T1.Now I get it! Perfect explanation. Thank you very much for your time.
Marco