I have a dataset composed of 2 ecotypes, 2 conditions (control, treatment) and 2 geographic units (24 samples in total). I want to look at which genes are differentially expressed:
1. - between ecotype A and B at control,
2. - between ecotype A and B at treatment
3. - between control and treatment for ecotype A,
4. - between control and treatment for ecotype B or
5- equally I would like to look at different combinations including the geographic unit, between north and south for ecotype A, etc.
and I would even like to raise one level of complexity and include the three factors, for example, difference between north and south at treatment conditions for ecotype A (etc).
Following the DESeq2 manual, section 3.3 (Interactions), my solution is to define a factor as a combination of the factors I want to measure for example:
condition=rep(rep(c("A.ctl","A.treat","B.ctl","B.treat"),each=3),2)
and then use just condition in the design formula:
cond <- DESeqDataSetFromMatrix(countData = countData,colData =colData, design = ~ condition)
and apply the contrast formula to look at two comparisons like so:
condP <- DESeq(cond)
resMTC <- results(condP, contrast = c("condition", "A.treat", "A.ctl"))
but someone told me to do this instead:
resMTC <- results(condP, contrast=c(1, -1, -1, 1))
Is this correct? If so why?
I would also like to know if in this case the multifactor analysis approach is more suitable. If I understand correctly, with the multifactor analysis including for example ecotype+condition, I will get a list of genes that are differentially expressed where the effect of the 2 ecotypes is controlled by the effect of the treatment. So I can extract the up and down regulated genes for the ecotype A and B, but I won't be able to know which genes are up regulated in A and at treatment conditions, whereas with the pairwise approach I would. Is this correct?
I was also suggested that if I do the multifactor analysis I have to use ecotype+condition+ecotype:condition for the full model and then ecotype:condition for the reduced but again wouldn't this analysis only yield the list of DEG between ecotypes A and B with no information about how the treatment/control conditions affect each gene? Otherwise how would I extract that information from the multifactor list.
These are my questions:
- Are ecotypes A and B responding different to the treatment (or to the control) conditions? (Q 1 and 2)
- What genes are DE between control and treatment for each ecotype? (Q 3 and 4)
- Does ecotype A in location 1 respond different to the treatment than the same ecotype in location 2? (Q 5)
Thanks
Sorry, then you are not gathering the correct view.
My response above, and the part of the vignette above and below the part you quoted explain when an interaction term is useful.
From the vignette:
"Interaction terms can be added to the design formula, in order to test, for example, if the log2 fold change attributable to a given condition is different based on another factor, for example if the condition effect differs across genotype. Many users begin to add interaction terms..."
And above:
"Some of your questions at the bottom are for testing differences in the treatment effect across ecotype, in other words, testing an interaction..."
Beyond this, you need to consult a local statistician or a person with a quantitative background. There is nothing special about DESeq2 here. You just need someone to explain to you when interaction terms are used in linear models, and to help you formulate your questions quantitatively in a linear model.
Hi , Thanks for your guiding, I consulted a statistician and was advised to use a multifactor analysis with all three factors and all possible interactions. He suggested to then remove in a step-wise fashion the insignificant coefficients. He has no experience with RNA-Seq data so now that I came to check the pvalues of the coefficients after running the multifactor test I realise each gene has a different estimate for the coefficient... I guess it is then better to try removing an interaction term from the full model one at a time and compare the fit of the full vs the reduced model, sequencially keeping always the one that performs better. My question is how can you do this is DESEq2?
Thanks
Great. What aov() does in base R (and so what you are seeing when you use summary.aov) is a series of model comparisons, where terms are sequentially removed. You can do this with DESeq2 by repeatedly running DESeq() with test="LRT" and removing terms sequentially from full and reduced.
I was editting my question as you responded. sorry because now it looks messy. Thanks again
Sorry, what I was asking before is how can I use the step function with DESeq2 results, meaning that I need to include a model name and when I use dds as an object in doesn't work.
You have to make the comparisons explicitly and manually with DESeq2. It's up to you to specify full and reduced (which can be formula or design matrices), which will give you results similar to a single row in a summary.aov table (for each gene).
I post below for ease of reading