Question: edgeR effects of design on testing main effects and interactions
0
3.9 years ago by
United States

Hi all,

I am new to R scripting and bioinformatics in general. I'm using edgeR to run DE statistics on a project with 2 factors: diet (consisting of 4 diets: 12, 24, 30 ,39) and stress (consisting of 5 stressors: C, L, H, B, LB). I have 4 replicates per treatment (80 samples in total). I would like to figure out which genes are DE across the main effects of diet, stress, and across a diet*stress interaction. I'm not sure how to run a full model at once that would give me the genes for the main effects and interactions (as is done in an ANOVA), so I've been looking at each separately by creating model matrices that just specify diet, just stress, an additive diet+stress model, and then a full model with all combinations. I noticed when I run a glmFit on diet, stress, diet+stress, and the interaction the number of DE genes changes depending on what model matrix I reference (see below). There are few differences between the diet and stress separate models and the diet+stress model, but large difference between those and the full model.

2 questions: Does anyone know why the number of DE genes changes depending on the design matrix?
Can anyone tell me the appropriate way to determine the main diet and stress effects and the interaction                      in one model (I know how to do individual contrasts but I'm looking for main and interaction effects first)?

carrie

> FDR.stressadd<-p.adjust(lrt.stressadd$table$PValue, method="BH")
[1] 5053

> FDR.dietadd<-p.adjust(lrt.dietadd$table$PValue, method="BH")
[1] 2913

DIET MODEL:

> glm.diet<-glmFit(Hz_dgelist_filtered, design.diet)
> lrt.diet<-glmLRT(glm.diet, coef=2:4)
> topTags(lrt.diet)
> FDR.diet<-p.adjust(lrt.diet$table$PValue, method="BH")
> sum(FDR.diet<0.05)
[1] 3225

STRESS MODEL:

>design.stress<-model.matrix(~stress)
> glm.stress<-glmFit(Hz_dgelist_filtered, design.stress)
> lrt.stress<-glmLRT(glm.stress, coef=2:5)
> topTags(lrt.stress)
> FDR.stress<-p.adjust(lrt.stress$table$PValue, method="BH")
> sum(FDR.stress<0.05)
[1] 5371

DIET/FULL MODEL:

> design.full<-model.matrix(~0+diet+stress+diet:stress)
> glm.dietfull<-glmFit(Hz_dgelist_filtered, design.full)
> lrt.dietfull<-glmLRT(glm.dietfull, coef=2:4)
> topTags(lrt.dietfull)
> FDR.dietfull<-p.adjust(lrt.dietfull$table$PValue, method="BH")
> sum(FDR.dietfull<0.05)
[1] 10719

STRESS/FULL MODEL:

> glm.stressfull<-glmFit(Hz_dgelist_filtered, design.full)
> lrt.stressfull<-glmLRT(glm.stressfull, coef=5:8)
> topTags(lrt.stressfull)
> FDR.stressfull<-p.adjust(lrt.stressfull$table$PValue, method="BH")
> sum(FDR.stressfull<0.05)
[1] 1043

> glm.intfull<-glmFit(Hz_dgelist_filtered, design.full)
> lrt.intfull<-glmLRT(glm.intfull, coef=9:20)
> topTags(lrt.intfull)
> FDR.intfull<-p.adjust(lrt.intfull$table$PValue, method="BH")
> sum(FDR.intfull<0.05)
[1] 587

edger glmlrt() glmfit • 1.5k views
modified 3.9 years ago by Aaron Lun25k • written 3.9 years ago by cadeans0
Answer: edgeR effects of design on testing main effects and interactions
5
3.9 years ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

The DE genes will obviously change with the different design matrices. When you use the full model, you're accounting for interactions between the diet and stress conditions. This is not the case when you use the additive model, such that if any significant interactions exist in your data set, they will not be properly modelled. The typical consequence of this would be inflation of the dispersion estimates, because non-additivity of responses results in deviance from the fitted value; and distortion of the computed log-fold changes for the "main effects", which cannot be interpreted in the presence of significant interaction terms.

Now, if you want to identify the main effect of stress or diet, you must ensure that the individual interaction terms are not significant. For example, say you want to identify the main effect of diet 12 against diet 24. This would not make any sense for a particular gene if, for example, that gene was upregulated in 12 vs. 24 in stress C, but downregulated in 12 vs. 24 in stress L (i.e., there is an interaction effect between 12 vs. 24 and L vs. C). Sure, you can compute an average log-fold change for 12 vs. 24 over all stressors, but that will be totally misleading if the 12 vs. 24 log-fold change in each stressor is not in a consistent direction. Even if the direction is the same, a large average log-fold change may be driven by one stressor, with near-zero values for all other stressors - this will be misleading as it will suggest a large overall effect where there is none.

So, to continue with this example; if you want to identify the diet main effect for 12 vs. 24, you need to identify the four interaction terms for 12 vs. 24 (one for each other stressor relative to whatever model.matrix has defined as the baseline - probably B), find the genes for which all of those four terms are not significant, and then you can test for a main effect of 12 vs. 24 by refitting a model without those interaction terms. This is quite a chore. An easier way to test for DE between diets 12 and 24 is to perform tests individually within each stressor, i.e., test for 12 vs. 24 in stress L, repeat for stress C, etc. Then, you can identify genes that are significant in all stressors (or, less conservatively, in a minimum number of stressors), with the requirement that all significant log-fold changes have the same sign. This intersection will yield a set of genes that have DE in the same direction between diets 12 and 24 across all or most stressors.

There are also other strategies to test for DE between 12 and 24 without having to formally test for a main effect, e.g., performing an ANOVA for 12 vs. 24 across all stressors and picking those significant genes with consistent signs of the log-fold change for all stressors, or comparing the averages of diet 12-associated coefficients with their diet 24 counterparts (though you'll need to identify non-significant interactions for this to make sense). In all cases, it's a lot easier to do this with a one-way layout rather than a factorial design:

grouping <- factor(paste0("d", diet, ".", stress))
design.oneway <- model.matrix(~0+grouping)
colnames(design.oneway) <- levels(grouping)

Testing for differences between diets 12 and 24 within a stressor (e.g., C) can be achieved by comparing the corresponding groups:

con <- makeContrasts(d12.C - d24.C, levels=design.oneway)

You can do this separately for each stressor, and intersect across some or all stressors as I've described in my first suggestion. Alternatively, you can do an ANOVA-like test across all stressors by combining the contrasts before testing:

con <- makeContrasts(d12.C - d24.C, d12.L - d24.L, d12.B - d24.B,
d12.H - d24.H, d12.LB - d24.LB, levels=design.oneway)

Or you can test for the average:

con <- makeContrasts((d12.C + d12.L + d12.B + d12.H + d12.LB)/5 -
(d24.C + d24.L + d24.B + d24.H + d24.LB)/5, levels=design.oneway)

Edit: Put a d in front of each group label, to get it to work properly.

Thanks for your explanation Aaron. I'm afraid I'm still a little confused. Please bear with me. When I say a "main effect" of diet, what I essentially want to know is which genes are DEed across all diet treatments, regardless of stress (main effect of stress would be DE across stress treatments regardless of diet). I tend to think of this like an ANOVA, when you want to see whether diet or stress have main effects and you can get a p-value for each, as well as a diet*stress interaction. The way I see it, which may be incorrect, is that you could essentially run an ANOVA (or more accurately a GLM) for each gene, with the dispersions being the dependent variable and ask whether expression is significantly different across diets, stressor, or diet-stress interactions. This would undoubtedly be tedious for tens of thousand of genes, but isn't that essentially what the edgeR glmfit/lrt function does? Please let me know if I'm completely off-base with this. However, if this is true wouldn't the following (which is from my initial question) give me the number of genes DE expressed across all diets, stressor, and interactions:

DIET/FULL MODEL:

> design.full<-model.matrix(~0+diet+stress+diet:stress)
> glm.dietfull<-glmFit(Hz_dgelist_filtered, design.full)
> lrt.dietfull<-glmLRT(glm.dietfull, coef=2:4)
> topTags(lrt.dietfull)
> FDR.dietfull<-p.adjust(lrt.dietfull$table$PValue, method="BH")
> sum(FDR.dietfull<0.05)
[1] 10719

STRESS/FULL MODEL:

> glm.stressfull<-glmFit(Hz_dgelist_filtered, design.full)
> lrt.stressfull<-glmLRT(glm.stressfull, coef=5:8)
> topTags(lrt.stressfull)
> FDR.stressfull<-p.adjust(lrt.stressfull$table$PValue, method="BH")
> sum(FDR.stressfull<0.05)
[1] 1043

INTERACTION/FULL MODEL:

> glm.intfull<-glmFit(Hz_dgelist_filtered, design.full)
> lrt.intfull<-glmLRT(glm.intfull, coef=9:20)
> topTags(lrt.intfull)
> FDR.intfull<-p.adjust(lrt.intfull$table$PValue, method="BH")
> sum(FDR.intfull<0.05)
[1] 587

After getting this general # of genes I will definitely want to make all the comparisons to narrow things down, so thanks for the contrasts code. One other question though, when I use:
con<-makeContrasts(12.C - 24.C, levels=design.oneway)

I get this error: Error: unexpected symbol in "con<-makeContrasts(12.C"

Any ideas?

thanks,

carrie

2

You're quite off-base. For starters, while a true ANOVA compares variances, this is not analogous to NB GLMs and NB dispersions. Instead, the LRT looks at the likelihood ratio, i.e., the difference between the total residual deviances of the null and full models. The same dispersion is used between the two models, so it's not really a dependent variable that changes according to your contrast. Anyway, regardless of whether you think of it in terms of an ANOVA or a LRT, the fact remains that you cannot interpret the p-values of your main effects if you have a significant interaction term. It's easy to get those p-values from various software (i.e., edgeR with a different model, or by using the basic anova function), but that doesn't ensure that the computed values have any meaning. See my original answer for some examples.

As for your contrasts, I can't be completely sure of what you've done, because I don't know what diet and stress are. However, I'm guessing you're just dropping the non-interaction terms associated with diet or stress (at least in the first two contrasts). This will not give you the main effect of diet or stress; the interaction terms will absorb DE in the null model, such that you will only be testing for DE in one stressor (when you drop the diet terms) or in one diet (when you drop the stress terms). This is a confusing issue at the best of times; check out this post (Different results in edgeR using simple vs GLM) and the associated answers for a discussion of this. In short, just because the coefficient is labeled by model.matrix as the effect of a particular diet/stressor doesn't mean that it actually represents that effect in the model.

Finally, the contrast code can't handle names that start with numbers; this has been corrected by slapping a d on the start.

Hi Aaron,

I am trying to do a similar analysis and I had a question about your last example of your response, in which you compare the expression between two diets, averaging over all stressors for each diet. Why do you divide by 2 here, and not 5, given that you are averaging the means of 5 groups?

Thanks,

Allie

Clearly, it's because I can't count.

Well spotted, I probably started off with two groups as an example, expanded the code to use all five groups, and forgot to change the denominator. This has now been fixed. Note that this doesn't affect the p-values, only the log-fold changes.