The design matrix doesn't distinguish between dependent variables and confounding variables. Instead, you identify the relevant variables when you do the DE comparison, by specifying the coefficient to drop in topTable
or by constructing a contrast with the makeContrasts
function. In such cases, you drop the coefficient corresponding to your dependent variable, or you construct a contrast involving the dependent variables, and just ignore the confounding variables in the design.
As for your actual specification; for continuous variables, I would suggest using splines:
require(splines)
X <- ns(bvo2mxkg, df=5)
design <- model.matrix(~X + age + as.factor(sex) + BMI, pData(exampleSet))
This allows for non-linear responses in the expression with respect to changes in the variable. The same logic applies for the other continuous variables like age and BMI. However, if you do this, make sure you set the spline df
such that you have enough residual d.f. for variance estimation. Also, all of the spline coefficients are now dependent variables, so you need to drop them all at once in the DE comparison to test for a bvo2mxkg
effect, e.g., coef=2:6
in the topTable
call.
Finally, you should keep in mind that your model assumes an additive confounding effect. For example, sex
is assumed to affect the absolute magnitude of expression differentially between sexes, but not the nature of the trend w.r.t. bvo2mxkg
. If you have a gene where expression goes up in males but down in females with increasing bvo2mxkg
, this will not be properly modelled by your design. The same limitation applies with the other confounders. A larger model with interaction terms might help, but this would have its own problems (e.g., loss of residual d.f., difficulty in defining interactions of two or more splines) so I would just stick with what you've already got.