Question

EdgeR: glmLRT(), Multifactorial designs and correction for multiple testing

0

Entering edit mode

Ekarl2 ▴ 80

@ekarl2-7202

Last seen 8.7 years ago

Sweden

My experimental design consists of four groups and two experimental factors (+/+, +/-, -/+ and -/-). In addition, each group has four biological replicates (the first biological replicate of each group was taken at the same time, the second replicates at another and so on) and a clear batch effect between replicates.

My design matrix looks like this:

design <- model.matrix(~batch+factor.1+factor.2)

             (Intercept) batch2 batch3 batch4 factor.1 factor.2
+/+.1           1      0      0      0               0      0
-/+.1           1      0      0      0               1      0
+/-.1           1      0      0      0               0      1
-/-.1           1      0      0      0               1      1
+/+.2           1      1      0      0               0      0
-/+.2           1      1      0      0               1      0
+/-.2           1      1      0      0               0      1
...
...
attr(,"assign")
[1] 0 1 1 1 2 3
attr(,"contrasts")
attr(,"contrasts")$batch
[1] "contr.treatment"

attr(,"contrasts")$factor.1
[1] "contr.treatment"

attr(,"contrasts")$factor.2
[1] "contr.treatment"

The factor.1 column represents presence/absence of factor 1 and factor.2 column represents presence/absence of factor 2.

So far, I have performed statistical significance testing between presence/absence of factor 1 and presence/absence of factor 2 in isolation:

lrt.factor1 <- glmLRT(fit, coef=5)
lrt.factor2 <- glmLRT(fit, coef=6)

However, this seems to mean that I am only correcting for x number of statistical significance tests (where x is the number of genes that survive filtering) in each test, but in total I am actually doing 2x tests spread over two separate applications of the glmLRT() function.

Is this really appropriate? It looks like the correction for multiple testing is artificially weak by this approach (since obviously the first glmLRT() doesn't "know" that I am running a second one after) and would thus produce more false positives / not be able to control FDR at the alleged level.

Is it more correct to do all the tests at once and correct for 2x number of tests instead?

lrt <- glmLRT(fit, coef=5:6)

Does the above command test to see if there are genes that are differentially expressed in any of the factor 1 (+/-) or factor 2 (+/-) comparisons and correct for 2x tests?

I have looked in the edgeR user manual and this kind of logic seems to be applied in the Arabidopsis case study when testing for differential expression between any of the batches (b1 vs b2, b2 vs. b3, b1 vs. b3) to see if there is a batch effect and then they use glmLRT(fit, coef=2:3), but I have only worked with R for a couple of weeks so a bit plagued by self-doubt...

edgeR glmLRT() multiple factor design multiple testing correction • 7.6k views

ADD COMMENT • link updated 10.1 years ago by Aaron Lun ★ 28k • written 10.1 years ago by Ekarl2 ▴ 80

score 3 · Accepted Answer · 2015-01-14

3

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 9 hours ago

The city by the bay

Firstly, the Benjamini-Hochberg correction is applied in topTags, not glmLRT. Secondly, the FDR should be controlled under the nominal threshold in each DE list, when you run topTags on each object returned by glmLRT. There's no loss of control here, no matter how many DE lists you have. For example, if I get 100 DE genes for coef=5 and 500 DE genes for coef=6 at a FDR of 5% in each list, I'd expect no more than 5 + 25 = 30 false positives out of a total of 600 discoveries (any common genes between the two lists are considered as distinct discoveries). This gives an overall FDR of 5%, so no liberalness is observed.

Now, your coef=5:6 approach addresses a different question. You are correct in saying that it looks for genes that change between any of the experimental factors. However, this is done with a single test for each gene. Thus, the number of tests is still equal to the number of genes. Which approach you should use depends on what question you're trying to answer. If you want to find DE genes that change between states for factor 1, then it makes sense to set coef=5 (and similarly, coef=6 for factor 2). If you want to find genes that change between states for any factor, then your coef=5:6 approach is better.

Finally, your design matrix doesn't actually reflect a four-group setup. As it is now, it assumes that the log-fold change is additive between the two experimental factors. This may not be appropriate, e.g., for cases involving epistasis or synergy. In such cases, a better formulation might be:

group <- paste0(factor.1, ".", factor.2)
design <- model.matrix(~batch + group)

ADD COMMENT • link 10.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you for a very illuminating answer.

It seems that I am using two topTags, one for each experimental factor, so it seems that the issue is there (I just misunderstood where it was in the edgeR procedure). I think I will use coef=5:6 and use a single topTags() to avoid this. How does that sound?

It seems that one of my experimental factors (factor.2) has no effect whatsoever (zero genes are differentially expressed at 0.05 FDR) when tested on its own. Is it still useful to check for interaction since that is the way the experimental design is set up? In other words, should I include an interaction term in my design matrix because that is the way the experimental design was done?

If so, my design matrix would look like this:

design <- model.matrix(~batch+factor.1+factor.2+factor.1*factor.2)

Does that make sense? Can I do a single glmLRT(coef=5:6:7) for all of them (factor 1, 2 and interaction) or are those done separately glmLRT(coef=5:6), glmLRT(coef=7)? If so, how can topTags() correctly manage the number of tests? Or was that no problem because of your explanation that "FDR should be controlled under the nominal threshold in each DE list.."?

ADD REPLY • link 10.1 years ago Ekarl2 ▴ 80

1

Entering edit mode

With regards to factor 2; it's important to stress that failure to detect any DE is not the same as no effect. For example, your original additive model might have been insufficient to describe the experimental set-up. This could lead to inflated dispersions and loss of power to detect genuine DE for factor 2. Now, you've got plenty of residual d.f. to play with, so you might as well specify the full model and see what you end up with. The approach I suggested with group is simpler to interpret, but if you insist on an interaction model, you should do:

design <- model.matrix(~batch+factor.1*factor.2)

If you run glmLRT with coef=5:7, your null hypothesis is that there's no difference between any of the four groups. If you run coef=5, your null is that there's no difference between +/+ and -/+. If you run coef=6, your null is that there's no difference between +/+ and +/-. If you run coef=7, your null is that there is no interaction, i.e., effects are additive. Which one you should use depends on the question you want to answer, rather than any consideration of FDR control.

ADD REPLY • link 10.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

The question that I would like to answer is either:

(1) what genes are differentially regulated by any of factor 1 and factor 2.

(2) what genes are differentially regulated by any of factor 1, factor 2 or interaction between factor 1 and factor 2.

If I understood your answer correctly, I should use coef=5:6 for (1) and coef:5:7 for (2). Is that correct?

ADD REPLY • link 10.1 years ago Ekarl2 ▴ 80

1

Entering edit mode

That sounds right. A more intiuitive interpretation of aim 1 is that you're looking for genes that are DE upon changing either of the factors individually. In contrast, aim 2 looks for DE genes between any of the four groups. So, any changes in expression in the double negative group will not cause rejections for aim 1 (as this is absorbed by the interaction term), but will be picked up by aim 2.

ADD REPLY • link 10.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

And doing coef=5 and coef=6 in two different glmLRT / topTags would be looking for those that differ for factor 1 specifically and those that differ for factor 2 specificaly?

ADD REPLY • link 10.1 years ago Ekarl2 ▴ 80

1

Entering edit mode

Yeah. If you intepret it from group-based perspective, the DE genes for coef=5 are those that differ between +/+ and -/+ (i.e., after changing the first factor), whereas the DE genes for coef=6 are those that differ between +/+ and +/- (i.e., after changing the second factor).

ADD REPLY • link 10.1 years ago Aaron Lun ★ 28k