I am trying to determine the most appropriate design matrix to use for an RNA-seq experiment detailed below. Since my statistics knowledge isn't strong, I'd really appreciate some advice.
We have four patients (specifically patient-derived cell-lines, A - D) and each patient was subjected to four different conditions: a combination of Oxygen (normoxia or hypoxia) plus Treatment (no drug or drug). I combined these conditions into a single factor (Group) with four levels to help with construction of the design matrix.
> library(tidyverse) > targets <- data.frame("Patient" = c(rep("A", 4), rep("B", 4), rep("C", 4), rep("D", 4)), "Oxygen" = c(rep("Normoxia", 2), rep("Hypoxia", 2)), "Treatment" = c("NoDrug", "Drug")) %>% unite("Group", Oxygen:Treatment, sep = ".", remove = FALSE) %>% mutate(Oxygen = factor(Oxygen, levels = c("Normoxia", "Hypoxia")), Treatment = factor(Treatment, levels = c("NoDrug", "Drug")), Group = factor(Group, levels = c("Normoxia.NoDrug", "Normoxia.Drug", "Hypoxia.NoDrug", "Hypoxia.Drug"))) %>% select(Patient, Oxygen, Treatment, Group) > targets Patient Oxygen Treatment Group A Normoxia NoDrug Normoxia.NoDrug A Normoxia Drug Normoxia.Drug A Hypoxia NoDrug Hypoxia.NoDrug A Hypoxia Drug Hypoxia.Drug B Normoxia NoDrug Normoxia.NoDrug B Normoxia Drug Normoxia.Drug B Hypoxia NoDrug Hypoxia.NoDrug B Hypoxia Drug Hypoxia.Drug C Normoxia NoDrug Normoxia.NoDrug C Normoxia Drug Normoxia.Drug C Hypoxia NoDrug Hypoxia.NoDrug C Hypoxia Drug Hypoxia.Drug D Normoxia NoDrug Normoxia.NoDrug D Normoxia Drug Normoxia.Drug D Hypoxia NoDrug Hypoxia.NoDrug D Hypoxia Drug Hypoxia.Drug
Key biological questions
We want to find differentially expressed genes between the following:
- Hypoxia.Drug vs Normoxia.NoDrug
- Hypoxia.Drug vs Normoxia.Drug
- Hypoxia.Drug vs Hypoxia.NoDrug
Additionally (but not as important):
- Normoxia.Drug vs Normoxia.NoDrug
- Hypoxia.NoDrug vs Normoxia.NoDrug
Possible design matrix?
I have tried my best to find a similar experimental design, I think it is most similar to Section 3.4.2 Blocking of the edgeR user's guide, but in this case we have 4 x sets of four samples (rather than paired samples), is that correct?
Here is the PCA plot for reference, as you can see the samples largely separate by Group, apart from Patient A. Therefore, I am thinking an additive model will be most appropriate, that includes a Patient term and Group term (combination of Oxygen and Treatment) as per below?
Group <- targets$Group Patient <- targets$Patient # DESeq2 uses "formula" function instead of "model.matrix" below design <- model.matrix(~Patient + Group) colnames(design) <- gsub(x = colnames(design), pattern = "Group", replacement = "") colnames(design)  "(Intercept)" "PatientB" "PatientC" "PatientD" "Normoxia.Drug" "Hypoxia.NoDrug" "Hypoxia.Drug"
With an additive model, I believe that the assumption is that the Oxygen plus Treatment combination has the same effect on all patients, which is what we see here? And the coefficients from this model can be easily used to form the contrasts of interest above.
Please let me know if this design matrix is appropriate. Many thanks in advance for your help, I always underestimate how difficult it is to set up the design matrix!
Thank you so much for your response Gordon, I truly appreciate it! Also glad to hear my design matrix is correct in this context :)
I have added a suggestion about an alternative representation without the intercept that might help you.