Hi: I'm trying to identify differentially methylated CpGs between two different treatments. I've been following the edgeR vignette as well as the Chen et al. 2018 paper but am running into confusing results depending on how I code the model matrix.
If I code the model matrix with an intercept (as suggested) it returns a result that nearly all sites are differentially methylated.
> design <- model.matrix(~ 0 + treatment2, data = gamopredict) %>% modelMatrixMeth()
> y <- estimateDisp(y, design, trend="none")
> contr <- makeContrasts(treat=treatment2P-treatment2W, levels=design)
> fit <- glmFit(y, design, contrast=contr)
> lrt <- glmLRT(fit)
> topTags(lrt)
Coefficient: treatment2W
Locus logFC logCPM LR PValue FDR
Chr12 18880978 -11.033343 10.521151 692.0369 1.611965e-152 9.704029e-150
Chr12 18880981 -10.990813 10.518484 686.1295 3.104467e-151 9.344445e-149
Chr12 18880973 -10.966224 10.521567 683.4430 1.191774e-150 2.391493e-148
Chr12 18881008 -10.071527 10.510221 676.7843 3.343775e-149 5.032381e-147
Chr12 18881002 -10.792768 10.519862 607.2119 4.520285e-134 5.442424e-132
Chr1A 87303489 -9.098479 9.649492 558.2882 1.981451e-123 1.988056e-121
Chr4 27136393 -9.856523 12.609553 554.7452 1.168751e-122 1.005126e-120
I get the same result whether or not I use makeContrasts.
However, if I code the matrix without the intercept, nothing is differentially methylated.
> design <- model.matrix(~ treatment2, data = gamopredict) %>% modelMatrixMeth()
> y <- estimateDisp(y, design, trend="none")
> fit <- glmFit(y, design, coef=2)
> lrt <- glmLRT(fit)
> topTags(lrt)
Coefficient: treatment2W
Locus logFC logCPM LR PValue FDR
Chr23 84 2.1879144 8.713076 9.245310 0.002360993 1
Chr1A 87303513 2.4901308 7.971224 7.326217 0.006795592 1
Chr1A 87303566 -1.9760789 7.973802 6.960347 0.008333586 1
Chr21 11734586 0.6170862 10.742564 6.046373 0.013934896 1
Chr3 3807533 -2.5787405 7.869589 5.628683 0.017668904 1
I don't really believe the results from the first model but I can't figure out why the results are so different. If anyone has any suggestions or thoughts I'd really appreciate it.
Based on your snippets, you are actually doing it the other way around. Using
~ 0 +
in your model matrix will actually give you the non-intercept design.Thanks Mikhael: you're right, the first design is without intercept and the second is with intercept. Do you have any idea why the models give me such different results?
Thanks James and Mikhael for pointing me in the right direction. 1) I did have a significant error in my code: the "contrast" and "coef" arguments should have been in the glmLRT() function, not the glmFit() function. 2) I also went and downloaded the bisulfite sequencing of mouse oocyte data from section 4.7 of the vignette to investigate a more reproducible example. In this example, the "Group" is the treatment, which has three levels (40um, 50um, and 60um) and there are two replicates of each treatment for a total of six samples.
I had, in fact, looked at the design matrix; however, it turns out that my error was in not understanding the "coef" argument in the glmLRT() function meant. I thought that the "coef" argument specified the level of the treatment I wanted to compare (i.e. coef = 2 would compare the second level to the first level). Because the examples in the vignette for this argument didn't use the expanded modelMatrixMeth, I didn't realize I needed to specify the column number instead. After looking more closely I realized my error.
To sum up: here are two equivalent ways of coding a glm to compare two treatments in a bisulfite experiment. (data from section 4.7 of the vignette, following filtering).
1) No intercept version, using the "contrast=" argument
2) Intercept version, using the "coef=" argument
Hopefully this helps anyone who runs into the same confusion. (edited to remove outputs which were not formatting right in the code sample)
In the second version, you don't even need to count coefficients. Instead of
coef=9
you can just usecoef="Group60um"
. Then you know it is the right coefficient.