I am analyzing RNA-seq reads to look at transcriptomic response to environmental stress, and I want to ensure that I am using the correct design formula for my experimental design. I have a condition factor with a control level and four other levels in no particular order (CL, A, B, C, D), biological sampling was done in triplicate, and there is a timepoint factor with four ordered levels (T1, T2, T3, T4).
My experimental questions
These are the types of questions I am interested in answering:
- For each non-control-level condition, accounting for timepoints, are they different from the control level?
- For each condition, including the control level, are their timepoints different in expression level?
- All else being equal, which genes were differentially expressed with respect to the reference condition?
- Accounting for timepoint, are arbitrary pairs (C vs A) of conditions different from each other?
What I've tried
I've read the DESeq2 paper, the vignette, and rummaged through various post on BioStars and Bioconductor forums. I've learned a lot from that in regard to the DESeq2 package and the mathematics it performs, but it is still unclear to me how to make the design formula that answers my questions. I've followed a tutorial on design formulae in R in general, but it did not clarify ordering of terms.
What are the rules for ordering the terms in a design formula for DESeq2? (What would does ~ A + B vs ~ B + A mean?) I'd like a description of 'the general case' rather than special cases. I'm not a programming or math phobe, so lay it on me.
With the contrasts argument in the results function, how do I similarly make comparisons that are conditioned by other factors?