Hi all,
I am new to R scripting and bioinformatics in general. I'm using edgeR to run DE statistics on a project with 2 factors: diet (consisting of 4 diets: 12, 24, 30 ,39) and stress (consisting of 5 stressors: C, L, H, B, LB). I have 4 replicates per treatment (80 samples in total). I would like to figure out which genes are DE across the main effects of diet, stress, and across a diet*stress interaction. I'm not sure how to run a full model at once that would give me the genes for the main effects and interactions (as is done in an ANOVA), so I've been looking at each separately by creating model matrices that just specify diet, just stress, an additive diet+stress model, and then a full model with all combinations. I noticed when I run a glmFit on diet, stress, diet+stress, and the interaction the number of DE genes changes depending on what model matrix I reference (see below). There are few differences between the diet and stress separate models and the diet+stress model, but large difference between those and the full model.
2 questions: Does anyone know why the number of DE genes changes depending on the design matrix?
Can anyone tell me the appropriate way to determine the main diet and stress effects and the interaction in one model (I know how to do individual contrasts but I'm looking for main and interaction effects first)?
Thanks in advance!
carrie
ADDITIVE MODEL:
>design.add<-model.matrix(~diet+stress)
> glm.stressadd<-glmFit(Hz_dgelist_filtered, design.add)
> design.add
> lrt.stressadd<-glmLRT(glm.stressadd, coef=5:8)
> topTags(lrt.stressadd)
> FDR.stressadd<-p.adjust(lrt.stressadd$table$PValue, method="BH")
> sum(FDR.stressadd<0.05)
[1] 5053
> glm.dietadd<-glmFit(Hz_dgelist_filtered, design.add)
> lrt.dietadd<-glmLRT(glm.dietadd, coef=2:4)
> topTags(lrt.dietadd)
> FDR.dietadd<-p.adjust(lrt.dietadd$table$PValue, method="BH")
> sum(FDR.dietadd<0.05)
[1] 2913
DIET MODEL:
> glm.diet<-glmFit(Hz_dgelist_filtered, design.diet)
> lrt.diet<-glmLRT(glm.diet, coef=2:4)
> topTags(lrt.diet)
> FDR.diet<-p.adjust(lrt.diet$table$PValue, method="BH")
> sum(FDR.diet<0.05)
[1] 3225
STRESS MODEL:
>design.stress<-model.matrix(~stress)
> glm.stress<-glmFit(Hz_dgelist_filtered, design.stress)
> lrt.stress<-glmLRT(glm.stress, coef=2:5)
> topTags(lrt.stress)
> FDR.stress<-p.adjust(lrt.stress$table$PValue, method="BH")
> sum(FDR.stress<0.05)
[1] 5371
DIET/FULL MODEL:
> design.full<-model.matrix(~0+diet+stress+diet:stress)
> glm.dietfull<-glmFit(Hz_dgelist_filtered, design.full)
> lrt.dietfull<-glmLRT(glm.dietfull, coef=2:4)
> topTags(lrt.dietfull)
> FDR.dietfull<-p.adjust(lrt.dietfull$table$PValue, method="BH")
> sum(FDR.dietfull<0.05)
[1] 10719
STRESS/FULL MODEL:
> glm.stressfull<-glmFit(Hz_dgelist_filtered, design.full)
> lrt.stressfull<-glmLRT(glm.stressfull, coef=5:8)
> topTags(lrt.stressfull)
> FDR.stressfull<-p.adjust(lrt.stressfull$table$PValue, method="BH")
> sum(FDR.stressfull<0.05)
[1] 1043
> glm.intfull<-glmFit(Hz_dgelist_filtered, design.full)
> lrt.intfull<-glmLRT(glm.intfull, coef=9:20)
> topTags(lrt.intfull)
> FDR.intfull<-p.adjust(lrt.intfull$table$PValue, method="BH")
> sum(FDR.intfull<0.05)
[1] 587
Thanks for your explanation Aaron. I'm afraid I'm still a little confused. Please bear with me. When I say a "main effect" of diet, what I essentially want to know is which genes are DEed across all diet treatments, regardless of stress (main effect of stress would be DE across stress treatments regardless of diet). I tend to think of this like an ANOVA, when you want to see whether diet or stress have main effects and you can get a p-value for each, as well as a diet*stress interaction. The way I see it, which may be incorrect, is that you could essentially run an ANOVA (or more accurately a GLM) for each gene, with the dispersions being the dependent variable and ask whether expression is significantly different across diets, stressor, or diet-stress interactions. This would undoubtedly be tedious for tens of thousand of genes, but isn't that essentially what the edgeR glmfit/lrt function does? Please let me know if I'm completely off-base with this. However, if this is true wouldn't the following (which is from my initial question) give me the number of genes DE expressed across all diets, stressor, and interactions:
DIET/FULL MODEL:
> design.full<-model.matrix(~0+diet+stress+diet:stress)
> glm.dietfull<-glmFit(Hz_dgelist_filtered, design.full)
> lrt.dietfull<-glmLRT(glm.dietfull, coef=2:4)
> topTags(lrt.dietfull)
> FDR.dietfull<-p.adjust(lrt.dietfull$table$PValue, method="BH")
> sum(FDR.dietfull<0.05)
[1] 10719
STRESS/FULL MODEL:
> glm.stressfull<-glmFit(Hz_dgelist_filtered, design.full)
> lrt.stressfull<-glmLRT(glm.stressfull, coef=5:8)
> topTags(lrt.stressfull)
> FDR.stressfull<-p.adjust(lrt.stressfull$table$PValue, method="BH")
> sum(FDR.stressfull<0.05)
[1] 1043
INTERACTION/FULL MODEL:
> glm.intfull<-glmFit(Hz_dgelist_filtered, design.full)
> lrt.intfull<-glmLRT(glm.intfull, coef=9:20)
> topTags(lrt.intfull)
> FDR.intfull<-p.adjust(lrt.intfull$table$PValue, method="BH")
> sum(FDR.intfull<0.05)
[1] 587
After getting this general # of genes I will definitely want to make all the comparisons to narrow things down, so thanks for the contrasts code. One other question though, when I use:
> con<-makeContrasts(12.C - 24.C, levels=design.oneway)
I get this error: Error: unexpected symbol in "con<-makeContrasts(12.C"
Any ideas?
thanks,
carrie
You're quite off-base. For starters, while a true ANOVA compares variances, this is not analogous to NB GLMs and NB dispersions. Instead, the LRT looks at the likelihood ratio, i.e., the difference between the total residual deviances of the null and full models. The same dispersion is used between the two models, so it's not really a dependent variable that changes according to your contrast. Anyway, regardless of whether you think of it in terms of an ANOVA or a LRT, the fact remains that you cannot interpret the p-values of your main effects if you have a significant interaction term. It's easy to get those p-values from various software (i.e.,
edgeR
with a different model, or by using the basicanova
function), but that doesn't ensure that the computed values have any meaning. See my original answer for some examples.As for your contrasts, I can't be completely sure of what you've done, because I don't know what
diet
andstress
are. However, I'm guessing you're just dropping the non-interaction terms associated with diet or stress (at least in the first two contrasts). This will not give you the main effect of diet or stress; the interaction terms will absorb DE in the null model, such that you will only be testing for DE in one stressor (when you drop the diet terms) or in one diet (when you drop the stress terms). This is a confusing issue at the best of times; check out this post (Different results in edgeR using simple vs GLM) and the associated answers for a discussion of this. In short, just because the coefficient is labeled bymodel.matrix
as the effect of a particular diet/stressor doesn't mean that it actually represents that effect in the model.Finally, the contrast code can't handle names that start with numbers; this has been corrected by slapping a
d
on the start.Hi Aaron,
I am trying to do a similar analysis and I had a question about your last example of your response, in which you compare the expression between two diets, averaging over all stressors for each diet. Why do you divide by 2 here, and not 5, given that you are averaging the means of 5 groups?
Thanks,
Allie
Clearly, it's because I can't count.
Well spotted, I probably started off with two groups as an example, expanded the code to use all five groups, and forgot to change the denominator. This has now been fixed. Note that this doesn't affect the p-values, only the log-fold changes.