Question

Use intercept or not in model matrix for edgeR methylation analysis

0

Entering edit mode

sabrina.mcnew • 0

@sabrinamcnew-21719

Last seen 4.7 years ago

Hi: I'm trying to identify differentially methylated CpGs between two different treatments. I've been following the edgeR vignette as well as the Chen et al. 2018 paper but am running into confusing results depending on how I code the model matrix.

If I code the model matrix with an intercept (as suggested) it returns a result that nearly all sites are differentially methylated.

> design <- model.matrix(~ 0 + treatment2, data = gamopredict) %>% modelMatrixMeth()
> y <- estimateDisp(y, design, trend="none") 
> contr <- makeContrasts(treat=treatment2P-treatment2W, levels=design)
> fit <- glmFit(y, design, contrast=contr)
> lrt <- glmLRT(fit)
> topTags(lrt)

    Coefficient:  treatment2W 
 Locus      logFC    logCPM       LR        PValue           FDR
Chr12 18880978 -11.033343 10.521151 692.0369 1.611965e-152 9.704029e-150
Chr12 18880981 -10.990813 10.518484 686.1295 3.104467e-151 9.344445e-149
Chr12 18880973 -10.966224 10.521567 683.4430 1.191774e-150 2.391493e-148
Chr12 18881008 -10.071527 10.510221 676.7843 3.343775e-149 5.032381e-147
Chr12 18881002 -10.792768 10.519862 607.2119 4.520285e-134 5.442424e-132
Chr1A 87303489  -9.098479  9.649492 558.2882 1.981451e-123 1.988056e-121
Chr4 27136393  -9.856523 12.609553 554.7452 1.168751e-122 1.005126e-120

I get the same result whether or not I use makeContrasts.

However, if I code the matrix without the intercept, nothing is differentially methylated.

> design <- model.matrix(~ treatment2, data = gamopredict) %>% modelMatrixMeth()
> y <- estimateDisp(y, design, trend="none") 
> fit <- glmFit(y, design, coef=2)
> lrt <- glmLRT(fit)
> topTags(lrt)

Coefficient:  treatment2W 
Locus      logFC    logCPM       LR      PValue FDR
Chr23       84  2.1879144  8.713076 9.245310 0.002360993   1
Chr1A 87303513  2.4901308  7.971224 7.326217 0.006795592   1
Chr1A 87303566 -1.9760789  7.973802 6.960347 0.008333586   1
Chr21 11734586  0.6170862 10.742564 6.046373 0.013934896   1
Chr3  3807533 -2.5787405  7.869589 5.628683 0.017668904   1

I don't really believe the results from the first model but I can't figure out why the results are so different. If anyone has any suggestions or thoughts I'd really appreciate it.

edgeR R methylation GLM • 1.6k views

ADD COMMENT • link 4.7 years ago sabrina.mcnew • 0

0

Entering edit mode

Based on your snippets, you are actually doing it the other way around. Using ~ 0 + in your model matrix will actually give you the non-intercept design.

ADD REPLY • link 4.7 years ago mikhael.manurung ▴ 270

0

Entering edit mode

Thanks Mikhael: you're right, the first design is without intercept and the second is with intercept. Do you have any idea why the models give me such different results?

ADD REPLY • link 4.7 years ago sabrina.mcnew • 0

0

Entering edit mode

Thanks James and Mikhael for pointing me in the right direction. 1) I did have a significant error in my code: the "contrast" and "coef" arguments should have been in the glmLRT() function, not the glmFit() function. 2) I also went and downloaded the bisulfite sequencing of mouse oocyte data from section 4.7 of the vignette to investigate a more reproducible example. In this example, the "Group" is the treatment, which has three levels (40um, 50um, and 60um) and there are two replicates of each treatment for a total of six samples.

I had, in fact, looked at the design matrix; however, it turns out that my error was in not understanding the "coef" argument in the glmLRT() function meant. I thought that the "coef" argument specified the level of the treatment I wanted to compare (i.e. coef = 2 would compare the second level to the first level). Because the examples in the vignette for this argument didn't use the expanded modelMatrixMeth, I didn't realize I needed to specify the column number instead. After looking more closely I realized my error.

To sum up: here are two equivalent ways of coding a glm to compare two treatments in a bisulfite experiment. (data from section 4.7 of the vignette, following filtering).

1) No intercept version, using the "contrast=" argument

> design <- model.matrix(~0+Group, data=targets) %>% modelMatrixMeth()
> fit <- glmFit(y1, design)
> contr <- makeContrasts(Group60vs40 = Group60um - Group40um, levels=design) #specify comparison between 2 groups
> lrt <- glmLRT(fit, contrast=contr)

2) Intercept version, using the "coef=" argument

> designb <-  model.matrix(~Group, data=targets) %>% modelMatrixMeth()
> y1.b <- estimateDisp(y1, design = designb, trend="none") 
> fit.b <- glmFit(y1.b, designb)
> lrt.b <- glmLRT(fit.b, coef = 9) #to compare 60 to 40 (the intercept) we choose the column number from the design matrix

Hopefully this helps anyone who runs into the same confusion. (edited to remove outputs which were not formatting right in the code sample)

ADD REPLY • link updated 4.7 years ago by Gordon Smyth 50k • written 4.7 years ago by sabrina.mcnew • 0

0

Entering edit mode

In the second version, you don't even need to count coefficients. Instead of coef=9 you can just use coef="Group60um". Then you know it is the right coefficient.

ADD REPLY • link 4.7 years ago Gordon Smyth 50k

score 1 · Answer 1 · 2019-08-26

As Mikhael noted, your code is mixed up, so it's not clear exactly what you have done. However, it is likely that you haven't really looked at (or tried to interpret) your design matrix. What modelMatrixMeth does is to generate a design matrix that contains subject-level coefficients, followed by the coefficients of interest. As an example, using the example from modelMatrixMeth:

mdlMtM> modelMatrixMeth(design.treatments)
   Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 (Intercept) TreatmentsB
1        1       0       0       0       0       0           1           0
2        1       0       0       0       0       0           0           0
3        0       1       0       0       0       0           1           0
4        0       1       0       0       0       0           0           0
5        0       0       1       0       0       0           1           1
6        0       0       1       0       0       0           0           0
7        0       0       0       1       0       0           1           1
8        0       0       0       1       0       0           0           0
9        0       0       0       0       1       0           1           0
10       0       0       0       0       1       0           0           0
11       0       0       0       0       0       1           1           0
12       0       0       0       0       0       1           0           0
   TreatmentsC
1            0
2            0
3            0
4            0
5            0
6            0
7            0
8            0
9            1
10           0
11           1
12           0

None of the sample-level coefficients are of interest! These are just intended to control for the average count, at each CpG, for each subject. And if you use coef = 2, you are almost surely asking for a test of a sample-level coefficient, which will be significant for almost all CpGs. You need to inspect your design matrix to identify the coefficients of interest, and then use topTags.