Question

Using limma when continuous and categorical confounders are present at the same time (e.g.age, gender)

0

Entering edit mode

sg45653 • 0

@sg45653-8423

Last seen 9.1 years ago

United States

Hello,

I have an Affymetrix microarray dataset where the goal of analysis is to identify genes significantly associated with a continous response (bvo2mxkg). Age (numeric), sex (categorical) and body mass index (numeric) are "confounders", for the purposes of analysis. I am wondering if the following design specification is correct (the expression and variable data are stored in an eSet object named exampleSet:

design <- model.matrix(~bvo2mxkg + age + as.factor(sex) + BMI, pData(exampleSet))

I am unclear on how to tell limma which is the dependent variable (bvo2mxkg) and which are covariates (age, BMI) or confoundes (sex).

Thanks in adance

limma covariate continuous categorical adjustment • 7.1k views

ADD COMMENT • link updated 9.7 years ago by Gordon Smyth 52k • written 9.7 years ago by sg45653 • 0

score 8 · Answer 1 · 2015-07-18

The design matrix doesn't distinguish between dependent variables and confounding variables. Instead, you identify the relevant variables when you do the DE comparison, by specifying the coefficient to drop in topTable or by constructing a contrast with the makeContrasts function. In such cases, you drop the coefficient corresponding to your dependent variable, or you construct a contrast involving the dependent variables, and just ignore the confounding variables in the design.

As for your actual specification; for continuous variables, I would suggest using splines:

require(splines)
X <- ns(bvo2mxkg, df=5)
design <- model.matrix(~X + age + as.factor(sex) + BMI, pData(exampleSet))

This allows for non-linear responses in the expression with respect to changes in the variable. The same logic applies for the other continuous variables like age and BMI. However, if you do this, make sure you set the spline df such that you have enough residual d.f. for variance estimation. Also, all of the spline coefficients are now dependent variables, so you need to drop them all at once in the DE comparison to test for a bvo2mxkg effect, e.g., coef=2:6 in the topTable call.

Finally, you should keep in mind that your model assumes an additive confounding effect. For example, sex is assumed to affect the absolute magnitude of expression differentially between sexes, but not the nature of the trend w.r.t. bvo2mxkg. If you have a gene where expression goes up in males but down in females with increasing bvo2mxkg, this will not be properly modelled by your design. The same limitation applies with the other confounders. A larger model with interaction terms might help, but this would have its own problems (e.g., loss of residual d.f., difficulty in defining interactions of two or more splines) so I would just stick with what you've already got.

score 5 · Answer 2 · 2015-07-19

Just to be clear about terminology, the "dependent" variable here is actually the vector of normalized log2-intensities for each probe-set. bvo2mxkg is your "predictor" or "independent" variable.

To find genes associated with bv02mxkg, you simply use topTable(fit, coef="bvo2mxkg") later on in the limma pipeline.

This tests for linear correlation between bvo02mxkg and log-expression (adjusted for age, sex and BMI). To test for more general nonlinear relationships, the regression spline approach suggested by Aaron is a good one.