Question

edgeR effects of design on testing main effects and interactions

0

Entering edit mode

cadeans • 0

@cadeans-9165

Last seen 8.5 years ago

United States

Hi all,

I am new to R scripting and bioinformatics in general. I'm using edgeR to run DE statistics on a project with 2 factors: diet (consisting of 4 diets: 12, 24, 30 ,39) and stress (consisting of 5 stressors: C, L, H, B, LB). I have 4 replicates per treatment (80 samples in total). I would like to figure out which genes are DE across the main effects of diet, stress, and across a diet*stress interaction. I'm not sure how to run a full model at once that would give me the genes for the main effects and interactions (as is done in an ANOVA), so I've been looking at each separately by creating model matrices that just specify diet, just stress, an additive diet+stress model, and then a full model with all combinations. I noticed when I run a glmFit on diet, stress, diet+stress, and the interaction the number of DE genes changes depending on what model matrix I reference (see below). There are few differences between the diet and stress separate models and the diet+stress model, but large difference between those and the full model.

2 questions: Does anyone know why the number of DE genes changes depending on the design matrix?
Can anyone tell me the appropriate way to determine the main diet and stress effects and the interaction in one model (I know how to do individual contrasts but I'm looking for main and interaction effects first)?

Thanks in advance!

carrie

ADDITIVE MODEL:

>design.add<-model.matrix(~diet+stress)
> glm.stressadd<-glmFit(Hz_dgelist_filtered, design.add)
> design.add
> lrt.stressadd<-glmLRT(glm.stressadd, coef=5:8)
> topTags(lrt.stressadd)
> FDR.stressadd<-p.adjust(lrt.stressadd$table$PValue, method="BH")
> sum(FDR.stressadd<0.05)
[1] 5053

> glm.dietadd<-glmFit(Hz_dgelist_filtered, design.add)
> lrt.dietadd<-glmLRT(glm.dietadd, coef=2:4)
> topTags(lrt.dietadd)
> FDR.dietadd<-p.adjust(lrt.dietadd$table$PValue, method="BH")
> sum(FDR.dietadd<0.05)
[1] 2913

DIET MODEL:

> glm.diet<-glmFit(Hz_dgelist_filtered, design.diet)
> lrt.diet<-glmLRT(glm.diet, coef=2:4)
> topTags(lrt.diet)
> FDR.diet<-p.adjust(lrt.diet$table$PValue, method="BH")
> sum(FDR.diet<0.05)
[1] 3225

STRESS MODEL:

>design.stress<-model.matrix(~stress)
> glm.stress<-glmFit(Hz_dgelist_filtered, design.stress)
> lrt.stress<-glmLRT(glm.stress, coef=2:5)
> topTags(lrt.stress)
> FDR.stress<-p.adjust(lrt.stress$table$PValue, method="BH")
> sum(FDR.stress<0.05)
[1] 5371

DIET/FULL MODEL:

> design.full<-model.matrix(~0+diet+stress+diet:stress)
> glm.dietfull<-glmFit(Hz_dgelist_filtered, design.full)
> lrt.dietfull<-glmLRT(glm.dietfull, coef=2:4)
> topTags(lrt.dietfull)
> FDR.dietfull<-p.adjust(lrt.dietfull$table$PValue, method="BH")
> sum(FDR.dietfull<0.05)
[1] 10719

STRESS/FULL MODEL:

> glm.stressfull<-glmFit(Hz_dgelist_filtered, design.full)
> lrt.stressfull<-glmLRT(glm.stressfull, coef=5:8)
> topTags(lrt.stressfull)
> FDR.stressfull<-p.adjust(lrt.stressfull$table$PValue, method="BH")
> sum(FDR.stressfull<0.05)
[1] 1043

> glm.intfull<-glmFit(Hz_dgelist_filtered, design.full)
> lrt.intfull<-glmLRT(glm.intfull, coef=9:20)
> topTags(lrt.intfull)
> FDR.intfull<-p.adjust(lrt.intfull$table$PValue, method="BH")
> sum(FDR.intfull<0.05)
[1] 587

edger glmfit glmlrt() • 3.5k views

ADD COMMENT • link updated 8.5 years ago by Aaron Lun ★ 28k • written 8.5 years ago by cadeans • 0

score 5 · Answer 1 · 2015-11-12

The DE genes will obviously change with the different design matrices. When you use the full model, you're accounting for interactions between the diet and stress conditions. This is not the case when you use the additive model, such that if any significant interactions exist in your data set, they will not be properly modelled. The typical consequence of this would be inflation of the dispersion estimates, because non-additivity of responses results in deviance from the fitted value; and distortion of the computed log-fold changes for the "main effects", which cannot be interpreted in the presence of significant interaction terms.

Now, if you want to identify the main effect of stress or diet, you must ensure that the individual interaction terms are not significant. For example, say you want to identify the main effect of diet 12 against diet 24. This would not make any sense for a particular gene if, for example, that gene was upregulated in 12 vs. 24 in stress C, but downregulated in 12 vs. 24 in stress L (i.e., there is an interaction effect between 12 vs. 24 and L vs. C). Sure, you can compute an average log-fold change for 12 vs. 24 over all stressors, but that will be totally misleading if the 12 vs. 24 log-fold change in each stressor is not in a consistent direction. Even if the direction is the same, a large average log-fold change may be driven by one stressor, with near-zero values for all other stressors - this will be misleading as it will suggest a large overall effect where there is none.

So, to continue with this example; if you want to identify the diet main effect for 12 vs. 24, you need to identify the four interaction terms for 12 vs. 24 (one for each other stressor relative to whatever model.matrix has defined as the baseline - probably B), find the genes for which all of those four terms are not significant, and then you can test for a main effect of 12 vs. 24 by refitting a model without those interaction terms. This is quite a chore. An easier way to test for DE between diets 12 and 24 is to perform tests individually within each stressor, i.e., test for 12 vs. 24 in stress L, repeat for stress C, etc. Then, you can identify genes that are significant in all stressors (or, less conservatively, in a minimum number of stressors), with the requirement that all significant log-fold changes have the same sign. This intersection will yield a set of genes that have DE in the same direction between diets 12 and 24 across all or most stressors.

There are also other strategies to test for DE between 12 and 24 without having to formally test for a main effect, e.g., performing an ANOVA for 12 vs. 24 across all stressors and picking those significant genes with consistent signs of the log-fold change for all stressors, or comparing the averages of diet 12-associated coefficients with their diet 24 counterparts (though you'll need to identify non-significant interactions for this to make sense). In all cases, it's a lot easier to do this with a one-way layout rather than a factorial design:

grouping <- factor(paste0("d", diet, ".", stress))
design.oneway <- model.matrix(~0+grouping)
colnames(design.oneway) <- levels(grouping)

Testing for differences between diets 12 and 24 within a stressor (e.g., C) can be achieved by comparing the corresponding groups:

con <- makeContrasts(d12.C - d24.C, levels=design.oneway)

You can do this separately for each stressor, and intersect across some or all stressors as I've described in my first suggestion. Alternatively, you can do an ANOVA-like test across all stressors by combining the contrasts before testing:

con <- makeContrasts(d12.C - d24.C, d12.L - d24.L, d12.B - d24.B, 
        d12.H - d24.H, d12.LB - d24.LB, levels=design.oneway)

Or you can test for the average:

con <- makeContrasts((d12.C + d12.L + d12.B + d12.H + d12.LB)/5 -
          (d24.C + d24.L + d24.B + d24.H + d24.LB)/5, levels=design.oneway)

Edit: Put a d in front of each group label, to get it to work properly.