I'm using edgeR version 3.10.2 to analyze my RNAseq data. I have a design in which there are two factors (habitat and watershed) with two levels in each factor. I have 3 replicates for each of the four groups, to make 12 samples total. I am also interested in the interaction between the two factors. To test for differentially expressed genes between the habitat levels, the watershed levels, and the interaction between them, I am using a design matrix (shown below) modelled on section 3.3.4 of the edgeR users guide:
(Intercept) habStream shedRoberts habStream:shedRoberts
ML1 1 0 0 0
ML2 1 0 0 0
ML3 1 0 0 0
MS1 1 1 0 0
MS2 1 1 0 0
MS3 1 1 0 0
RL1 1 0 1 0
RL2 1 0 1 0
RL3 1 0 1 0
RS1 1 1 1 1
RS2 1 1 1 1
RS3 1 1 1 1
To first look at genes differentially expressed between the habitat levels, I use:
lrt.hab <- glmLRT(fit, coef=2)
because the second model coefficient is for the habitat term. This returns a list of the top de genes.
However, I was interested to see what would happen if I redid the analysis as a simple comparison between the two habitat levels, ignoring the other factor. To do this, I use the exactTest function. When I get the list of top de genes for this analysis, the first gene is the same as in the previous analysis, but all the others are different.
My question is, when using the glmLRT, is it working like a 2-way ANOVA in that it is keeping the other factor (watershed) constant while testing for de between habitats? Is the the reason I am getting two different lists? In that case, which method is more statistically sound?
Thanks for your help,
R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
attached base packages:
 stats graphics grDevices utils datasets methods base
other attached packages:
 statmod_1.4.21 edgeR_3.10.2 limma_3.24.15
loaded via a namespace (and not attached):
 tools_3.2.0 splines_3.2.0
I am going to have to disagree with certain aspects of both my and Yunshun's answers.
First, I didn't read your question closely, and didn't understand that you were fitting an exactTest() on all your data, ignoring the watershed. In that case you have more degrees of freedom than the factorial model you fit, and the differences are probably mostly due to biases that arise because you are ignoring important information. So Yunshun makes a good point that you should look at MDS plots to see if there are big differences due to watershed.
Second, I disagree with Yunshun's contention that the second coefficient measures the difference between the two habitats within one watershed. While in some technical sense this is true (you are setting one watershed as the 'baseline', and making comparisons based on that), this is still a main effect and is intended to measure the average difference between habitats after controlling for the average difference between watersheds. In other words, you could relevel() your watershed factor and re-fit the model and the estimated values of your habitat coefficient wouldn't change.
HOWEVER, this is only true for those genes that do not have a significant interaction term, and in the case that you don't actually fit the interaction. If the interaction term is significant, then the second coefficient isn't interpretable (neither is the third coefficient). It neither represents the habitat effect within a given watershed, nor a main effect. It's uninterpretable.
On the other hand, for those genes where you can drop the interaction term, you should do so, in which case the second term should be interpreted as the main effect of habitat (where you have blocked on watershed to control for inter-watershed differences).
This is a tricky thing that I am not sure many people think about when doing genomics analysis. If you fit a factorial model with an interaction, you cannot then look at the top results for the main effects because if there is a significant interaction, then the main effects are not interpretable. You could hypothetically fit your model, find the genes with a significant interaction, then exclude those genes and fit a 2-way ANOVA without an interaction on the remaining genes.
Or you could do the arguably smarter thing, which is what Yunshun recommends, and fit a one-way layout. In that scenario, you can still fit an interaction using a contrast, and you can make comparisons within the two different watersheds (or within streams). Comparisons between the two watersheds (or between streams, averaging over watersheds) can still be a bit fraught, as the average of habitat1 vs habitat2 is still uninterpretable if there is a significant interaction.
I'm pretty sure that with the parametrization shown, the second coefficient does indeed measure the difference between habitats within the baseline watershed, not the main effect of habitat across all watersheds. If you used sum-to-zero contrasts for watershed, then I think the second coefficient would represent the average effect across all watersheds (which is still not quite the same thing as the main effect unless the design is perfectly balanced).
But in any case, I definitely agree that using the one-way design avoids all this confusion entirely and is probably the preferred way to do this analysis.
I agree with Ryan and Andy in the interpretation of the second coefficient, in that it represents the log-fold change between MS and ML only. This is because the log-fold change between RL and RS will be modelled as a linear combination of the second and fourth coefficients. The second coefficient could take on any value it wants, without affecting the ability of the model to fit to the observed RL-RS log-fold change (as the fourth coefficient picks up the slack, so to speak). This suggests that the value of the second coefficient does not depend on what RL-RS are doing, only on the behaviour of ML-MS. Thus, the second coefficient shouldn't be interpreted as a main habitat effect in the full model.
Either that, or we're all wrong, and we'll have enough egg on our faces to make a decent-sized omelette.
Edit: Andy is Yunshun, by the way. That's why you never see them both in the same room at the same time.
Ugh. You're right. I'll scrape the egg off of my face and cook the omelette. And now you know why I never fit 2-way ANOVA...
Good to know that there will be no egg on my face...