Dear all, despite reading on forums and others, I am still confused about how to properly set my design matrix for the following RNA-seq experiment. As a matter of fact, the more I read the less I know what strategy to go for, and whether my ideas are even correct.
There are:
- 3 different tissues: leaves, callus, cell suspension
- 2 different treatments: MJ and YE, each has its own mock: EtOH or water, respectively
- 2 timepoints: 12, 24h
What it looks like:
# full view
> as.data.frame(sample_info)
tissues mock treat t
1 lf H2O YE 12
2 lf H2O YE 24
3 lf H2O ctrl 12
4 lf H2O ctrl 24
5 cal H2O YE 12
6 cal H2O YE 24
7 cal H2O ctrl 12
8 cal H2O ctrl 24
9 sus H2O YE 12
10 sus H2O YE 24
11 sus H2O ctrl 12
12 sus H2O ctrl 24
13 lf EtOH MJ 12
14 lf EtOH MJ 24
15 lf EtOH ctrl 12
16 lf EtOH ctrl 24
17 cal EtOH MJ 12
18 cal EtOH MJ 24
19 cal EtOH ctrl 12
20 cal EtOH ctrl 24
21 sus EtOH MJ 12
22 sus EtOH MJ 24
23 sus EtOH ctrl 12
24 sus EtOH ctrl 24
# setting-up reference levels
> sample_info %<>% mutate('tissues'=factor(tissues, levels=c('lf', 'cal', 'sus')),
'mock'=factor(mock, levels=c('H2O', 'EtOH')),
'treat'=factor(treat, levels=c('ctrl', 'MJ', 'YE')),
't'=factor(as.character(t), levels=c('12', '24')))
So far I imagined setting up the model as:
# design matrix
> design_matrix <- model.matrix(~ 0 + tissues + mock + treat + t, data=sample_info)
# cleaning up
> rownames(design_matrix) <- sample_info %>% unite('sample', everything(), sep='_') %>% pull(sample)
> colnames(design_matrix) %<>% str_remove('tissues')
# full view
> design_matrix
lf cal sus mockEtOH treatMJ treatYE t24
lf_H2O_YE_12 1 0 0 0 0 1 0
lf_H2O_YE_24 1 0 0 0 0 1 1
lf_H2O_ctrl_12 1 0 0 0 0 0 0
lf_H2O_ctrl_24 1 0 0 0 0 0 1
cal_H2O_YE_12 0 1 0 0 0 1 0
cal_H2O_YE_24 0 1 0 0 0 1 1
cal_H2O_ctrl_12 0 1 0 0 0 0 0
cal_H2O_ctrl_24 0 1 0 0 0 0 1
sus_H2O_YE_12 0 0 1 0 0 1 0
sus_H2O_YE_24 0 0 1 0 0 1 1
sus_H2O_ctrl_12 0 0 1 0 0 0 0
sus_H2O_ctrl_24 0 0 1 0 0 0 1
lf_EtOH_MJ_12 1 0 0 1 1 0 0
lf_EtOH_MJ_24 1 0 0 1 1 0 1
lf_EtOH_ctrl_12 1 0 0 1 0 0 0
lf_EtOH_ctrl_24 1 0 0 1 0 0 1
cal_EtOH_MJ_12 0 1 0 1 1 0 0
cal_EtOH_MJ_24 0 1 0 1 1 0 1
cal_EtOH_ctrl_12 0 1 0 1 0 0 0
cal_EtOH_ctrl_24 0 1 0 1 0 0 1
sus_EtOH_MJ_12 0 0 1 1 1 0 0
sus_EtOH_MJ_24 0 0 1 1 1 0 1
sus_EtOH_ctrl_12 0 0 1 1 0 0 0
sus_EtOH_ctrl_24 0 0 1 1 0 0 1
In terms of overall differences, I naturally want to investigate:
- overall difference between tissues
- overall difference between mocks
- overall difference between treatments
Is the current model correctly set or should treatments and mocks be grouped? To my understanding, YE is nested in H2O, and MJ nested in EtOH, but not the controls, so I am unsure what's best to do here.
With regards to interactions, I am mostly interested in the difference of response to treatments (MJ & YE) between tissues. Should I set up interactions directly in the model.matrix()
call by adding + treatment:tissue
(I believe), or via a contrast?
To make things more fun, I don't have any replicate because of the COVID situation. Nevertheless, I am thinking of considering 12 & 24h as replicates since I barely see any separation of timepoints in PCA. I know it is far from optimal, but how would the design evolve in that case?
Many thanks for your help in advance!