Search
Question: Using limma when continuous and categorical confounders are present at the same time (e.g.age, gender)
0
gravatar for sg45653
2.8 years ago by
sg456530
United States
sg456530 wrote:

Hello, 

I have an Affymetrix microarray dataset where the goal of analysis is to identify genes significantly associated with a continous response (bvo2mxkg). Age (numeric), sex (categorical) and body mass index (numeric) are "confounders", for the purposes of analysis. I am wondering if the following design specification is correct (the expression and variable data are stored in an eSet object named exampleSet:

design <- model.matrix(~bvo2mxkg + age + as.factor(sex) + BMI, pData(exampleSet))

I am unclear on how to tell limma which is the dependent variable (bvo2mxkg) and which are covariates (age, BMI) or confoundes (sex).

Thanks in adance

ADD COMMENTlink modified 2.8 years ago by Gordon Smyth33k • written 2.8 years ago by sg456530
8
gravatar for Aaron Lun
2.8 years ago by
Aaron Lun19k
Cambridge, United Kingdom
Aaron Lun19k wrote:

The design matrix doesn't distinguish between dependent variables and confounding variables. Instead, you identify the relevant variables when you do the DE comparison, by specifying the coefficient to drop in topTable or by constructing a contrast with the makeContrasts function. In such cases, you drop the coefficient corresponding to your dependent variable, or you construct a contrast involving the dependent variables, and just ignore the confounding variables in the design.

As for your actual specification; for continuous variables, I would suggest using splines:

require(splines)
X <- ns(bvo2mxkg, df=5)
design <- model.matrix(~X + age + as.factor(sex) + BMI, pData(exampleSet))

This allows for non-linear responses in the expression with respect to changes in the variable. The same logic applies for the other continuous variables like age and BMI. However, if you do this, make sure you set the spline df such that you have enough residual d.f. for variance estimation. Also, all of the spline coefficients are now dependent variables, so you need to drop them all at once in the DE comparison to test for a bvo2mxkg effect, e.g., coef=2:6 in the topTable call.

Finally, you should keep in mind that your model assumes an additive confounding effect. For example, sex is assumed to affect the absolute magnitude of expression differentially between sexes, but not the nature of the trend w.r.t. bvo2mxkg. If you have a gene where expression goes up in males but down in females with increasing bvo2mxkg, this will not be properly modelled by your design. The same limitation applies with the other confounders. A larger model with interaction terms might help, but this would have its own problems (e.g., loss of residual d.f., difficulty in defining interactions of two or more splines) so I would just stick with what you've already got.

ADD COMMENTlink written 2.8 years ago by Aaron Lun19k
4
gravatar for Gordon Smyth
2.8 years ago by
Gordon Smyth33k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth33k wrote:

Just to be clear about terminology, the "dependent" variable here is actually the vector of normalized log2-intensities for each probe-set. bvo2mxkg is your "predictor" or "independent" variable.

To find genes associated with bv02mxkg, you simply use topTable(fit, coef="bvo2mxkg") later on in the limma pipeline.

This tests for linear correlation between bvo02mxkg and log-expression (adjusted for age, sex and BMI). To test for more general nonlinear relationships, the regression spline approach suggested by Aaron is a good one.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Gordon Smyth33k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 421 users visited in the last hour