edgeR for a mixture experiment
1
0
Entering edit mode
ingerslev • 0
@ingerslev-19177
Last seen 12 weeks ago
Denmark

I am analysing an RRBS experiment where my conditions are a mixture of three compounds. These compounds sum to one, so A + B + C = 1, and my conditions are various percentages of A, B and C.

The statisticians who usually analyze clinical data using these models have been recommended to use models from the mixexp package, specifically the linear and quadratic models.

This means I want to compare three models:

model_null <- ~ 1
model_linear <- ~ 0 + A + B + C
model_quadratic <- ~ 0 + A + B + C + A:B + A:C + B:C


The problem is that testing the coefficients, as recommended multiple times, e.g. here, here and here is not really possible.

I suppose that for comparing the quadratic to the linear model, I can really just look at whether the interaction terms are significant, but for comparing the linear to the null model is not possible by testing whether A, B and C are significant since there is no intercept.

As I understand it, this model better handles to covariance between the three compounds than a regular ~A + B model would.

My goal is to select sites where the mixture affects the methylation level, and then afterwards work with the estimated coefficients, so I don't need to be able to test the significance of individual compounds. I really only want to detect if the site is affected at all.

Is there some way to select significant sites using such a model? Maybe fitting a model of the form ~A + B, selecting sites that are significant and then estimating the coefficients using a ~0 + A + B + C model?

I realize that there might not be a best model for all sites, but I was hoping to select one model that fit most sites and use that for all sites.

edgeR model-selection RRBS mixture-experiment • 247 views
3
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

To compare the linear model to the intercept, you can simply fit ~ A + B and test the A and B coefficients equal to zero. The works because C is a function of A and B (C = 1 - A - B) so adding C to the model as well as 1 is redundant. You could do the same test by fitting ~B+C or ~A+C. I believe all three fits will give the same test results and p-values.

0
Entering edit mode

Thank you for the quick reply. I took the naming from the mixexp publication, but never mind that.

The three models cannot be exactly the same as evidenced in the code below:

exp_setup <- structure(list(Mixture = LETTERS[1:10],
A = c(7L, 7L, 14L, 14L, 21L, 21L, 30L, 35L, 42L, 50L),
B = c(33L, 78L, 56L, 26L, 64L, 34L, 40L, 20L, 43L, 20L),
C = c(60L, 15L, 30L, 60L, 15L, 45L, 30L, 45L, 15L, 30L)),
row.names = c(NA, -10L), class = "data.frame")
mod1 <- model.matrix(~ A + B + C, data = exp_setup)
mod2 <- model.matrix(~ 1 + A + B + C, data = exp_setup)
mod3 <- model.matrix(~ 0 + A + B + C, data = exp_setup)

is.fullrank(mod1)
is.fullrank(mod2)
is.fullrank(mod3)


The first two are identical, but only the last model is of full rank.

1
Entering edit mode

My apologies. I had actually assumed from your use of : that A, B and C were defined as factors. If A, B and C had been factors then the three models would be the same.

Even in your case, the three models are the same once the non-estimable columns are removed from mod1 and mod2. That would be done automatically by limma but isn't done automatically by edgeR.

I'm going to edit my answer to give different advice that doesn't involve a non-full rank matrix.

I have also consulted Scheffe (1958) on mixture models and I see now that your model_quadratic is indeed a quadratic model when A, B and C add to unity. The linearly dependency between A, B and C makes the extra terms that usually appear in a quadratic formula redundant.

0
Entering edit mode

Thank you for the update, it helped to clarify it for me. Would the following procedure then be how to select an appropriate model:

1. Fit the model mod1 <- ~ 1 + B + C + A:B + A:C + B:C and test glmLRT(coef = c("A:B", "A:C", "B:C")).
2. If the test for interaction terms are significant for any site, fit the model mod2 <- ~ 0 + A + B + C + A:B + A:C + B:C and extract the coefficients where glmLRT(coef = c("B", "C", "A:B", "A:C", "B:C")) in mod1 are significant.
3. If no interaction terms are significant in mod1, fit the models mod3 <- ~A + B and mod4 <- ~ 0 + A + B + C and extract the coefficients in mod4 where glmLRT(coef = c("B", "C")) is significant in mod3.
0
Entering edit mode

I just realized that I can fit the model ~ 0 + A + B + C + A:B + A:C + B:C and if the columns A:B, A:C and B:C are significant, use the estimated coefficients directly. If they are not I can fit the ~ A + B and ~ 0 + A + B + C models and, since they are identical, use ~ A + B to select significant sites. I am marking this answer as accepted. Thank you for your time.