Question

multifactorial design matrix

0

Entering edit mode

athief ▴ 20

@axelthieffry-11787

Last seen 3.4 years ago

Copenhagen

Dear all, despite reading on forums and others, I am still confused about how to properly set my design matrix for the following RNA-seq experiment. As a matter of fact, the more I read the less I know what strategy to go for, and whether my ideas are even correct.

There are:

3 different tissues: leaves, callus, cell suspension
2 different treatments: MJ and YE, each has its own mock: EtOH or water, respectively
2 timepoints: 12, 24h

What it looks like:

# full view
> as.data.frame(sample_info)
   tissues mock treat  t
1       lf  H2O    YE 12
2       lf  H2O    YE 24
3       lf  H2O  ctrl 12
4       lf  H2O  ctrl 24
5      cal  H2O    YE 12
6      cal  H2O    YE 24
7      cal  H2O  ctrl 12
8      cal  H2O  ctrl 24
9      sus  H2O    YE 12
10     sus  H2O    YE 24
11     sus  H2O  ctrl 12
12     sus  H2O  ctrl 24
13      lf EtOH    MJ 12
14      lf EtOH    MJ 24
15      lf EtOH  ctrl 12
16      lf EtOH  ctrl 24
17     cal EtOH    MJ 12
18     cal EtOH    MJ 24
19     cal EtOH  ctrl 12
20     cal EtOH  ctrl 24
21     sus EtOH    MJ 12
22     sus EtOH    MJ 24
23     sus EtOH  ctrl 12
24     sus EtOH  ctrl 24

# setting-up reference levels
> sample_info %<>% mutate('tissues'=factor(tissues, levels=c('lf', 'cal', 'sus')),
                          'mock'=factor(mock, levels=c('H2O', 'EtOH')),
                          'treat'=factor(treat, levels=c('ctrl', 'MJ', 'YE')),
                          't'=factor(as.character(t), levels=c('12', '24')))

So far I imagined setting up the model as:

# design matrix
> design_matrix <- model.matrix(~ 0 + tissues + mock + treat + t, data=sample_info)

# cleaning up
> rownames(design_matrix) <- sample_info %>% unite('sample', everything(), sep='_') %>% pull(sample)
> colnames(design_matrix) %<>% str_remove('tissues')

# full view
> design_matrix
                 lf cal sus mockEtOH treatMJ treatYE t24
lf_H2O_YE_12      1   0   0        0       0       1   0
lf_H2O_YE_24      1   0   0        0       0       1   1
lf_H2O_ctrl_12    1   0   0        0       0       0   0
lf_H2O_ctrl_24    1   0   0        0       0       0   1
cal_H2O_YE_12     0   1   0        0       0       1   0
cal_H2O_YE_24     0   1   0        0       0       1   1
cal_H2O_ctrl_12   0   1   0        0       0       0   0
cal_H2O_ctrl_24   0   1   0        0       0       0   1
sus_H2O_YE_12     0   0   1        0       0       1   0
sus_H2O_YE_24     0   0   1        0       0       1   1
sus_H2O_ctrl_12   0   0   1        0       0       0   0
sus_H2O_ctrl_24   0   0   1        0       0       0   1
lf_EtOH_MJ_12     1   0   0        1       1       0   0
lf_EtOH_MJ_24     1   0   0        1       1       0   1
lf_EtOH_ctrl_12   1   0   0        1       0       0   0
lf_EtOH_ctrl_24   1   0   0        1       0       0   1
cal_EtOH_MJ_12    0   1   0        1       1       0   0
cal_EtOH_MJ_24    0   1   0        1       1       0   1
cal_EtOH_ctrl_12  0   1   0        1       0       0   0
cal_EtOH_ctrl_24  0   1   0        1       0       0   1
sus_EtOH_MJ_12    0   0   1        1       1       0   0
sus_EtOH_MJ_24    0   0   1        1       1       0   1
sus_EtOH_ctrl_12  0   0   1        1       0       0   0
sus_EtOH_ctrl_24  0   0   1        1       0       0   1

In terms of overall differences, I naturally want to investigate:

overall difference between tissues
overall difference between mocks
overall difference between treatments

Is the current model correctly set or should treatments and mocks be grouped? To my understanding, YE is nested in H2O, and MJ nested in EtOH, but not the controls, so I am unsure what's best to do here.

With regards to interactions, I am mostly interested in the difference of response to treatments (MJ & YE) between tissues. Should I set up interactions directly in the model.matrix() call by adding + treatment:tissue (I believe), or via a contrast?

To make things more fun, I don't have any replicate because of the COVID situation. Nevertheless, I am thinking of considering 12 & 24h as replicates since I barely see any separation of timepoints in PCA. I know it is far from optimal, but how would the design evolve in that case?

Many thanks for your help in advance!

designmatrix RNA-seq limma • 1.7k views

ADD COMMENT • link updated 4.8 years ago by Gordon Smyth 53k • written 4.9 years ago by athief ▴ 20

score 1 · Answer 1 · 2021-04-03

Why not setup the design matrix in the standard way that we recommend for most experiments? If you ignore time as replicates, then you have an experiment with 12 treatment groups:

Group <- factor(paste(tissues,mock,treat,sep="."))
design <- model.matrix(~0+Group)
colnames(design) <- levels(Group)

You could also consider

design <- model.matrix(~0+Group+t)

which would adjust for baseline differences between 12h and 24h, if there are differences.

score 0 · Answer 2 · 2021-04-03

0

Entering edit mode

athief ▴ 20

@axelthieffry-11787

Last seen 3.4 years ago

Copenhagen

Anyone :) ?

ADD COMMENT • link 4.8 years ago athief ▴ 20