Using limma when continuous and categorical confounders are present at the same time (e.g.age, gender)
2
0
Entering edit mode
sg45653 • 0
@sg45653-8423
Last seen 8.1 years ago
United States

Hello, 

I have an Affymetrix microarray dataset where the goal of analysis is to identify genes significantly associated with a continous response (bvo2mxkg). Age (numeric), sex (categorical) and body mass index (numeric) are "confounders", for the purposes of analysis. I am wondering if the following design specification is correct (the expression and variable data are stored in an eSet object named exampleSet:

design <- model.matrix(~bvo2mxkg + age + as.factor(sex) + BMI, pData(exampleSet))

I am unclear on how to tell limma which is the dependent variable (bvo2mxkg) and which are covariates (age, BMI) or confoundes (sex).

Thanks in adance

limma covariate continuous categorical adjustment • 6.4k views
ADD COMMENT
8
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 3 hours ago
The city by the bay

The design matrix doesn't distinguish between dependent variables and confounding variables. Instead, you identify the relevant variables when you do the DE comparison, by specifying the coefficient to drop in topTable or by constructing a contrast with the makeContrasts function. In such cases, you drop the coefficient corresponding to your dependent variable, or you construct a contrast involving the dependent variables, and just ignore the confounding variables in the design.

As for your actual specification; for continuous variables, I would suggest using splines:

require(splines)
X <- ns(bvo2mxkg, df=5)
design <- model.matrix(~X + age + as.factor(sex) + BMI, pData(exampleSet))

This allows for non-linear responses in the expression with respect to changes in the variable. The same logic applies for the other continuous variables like age and BMI. However, if you do this, make sure you set the spline df such that you have enough residual d.f. for variance estimation. Also, all of the spline coefficients are now dependent variables, so you need to drop them all at once in the DE comparison to test for a bvo2mxkg effect, e.g., coef=2:6 in the topTable call.

Finally, you should keep in mind that your model assumes an additive confounding effect. For example, sex is assumed to affect the absolute magnitude of expression differentially between sexes, but not the nature of the trend w.r.t. bvo2mxkg. If you have a gene where expression goes up in males but down in females with increasing bvo2mxkg, this will not be properly modelled by your design. The same limitation applies with the other confounders. A larger model with interaction terms might help, but this would have its own problems (e.g., loss of residual d.f., difficulty in defining interactions of two or more splines) so I would just stick with what you've already got.

ADD COMMENT
5
Entering edit mode
@gordon-smyth
Last seen 15 minutes ago
WEHI, Melbourne, Australia

Just to be clear about terminology, the "dependent" variable here is actually the vector of normalized log2-intensities for each probe-set. bvo2mxkg is your "predictor" or "independent" variable.

To find genes associated with bv02mxkg, you simply use topTable(fit, coef="bvo2mxkg") later on in the limma pipeline.

This tests for linear correlation between bvo02mxkg and log-expression (adjusted for age, sex and BMI). To test for more general nonlinear relationships, the regression spline approach suggested by Aaron is a good one.

ADD COMMENT

Login before adding your answer.

Traffic: 888 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6