When I run limma compare disease vs control, the design object has two columns. However, when I adjust for covariates: sex (factor with 2 levels 0 and 1) and age, I have 4 columns: vector_age, vector_sex1, vector_sex2, vector_groupdisease

contrast_matrix <- makeContrasts(diseasevsNC=group_vectordisease-group_vectorNC, levels=design)
Error in eval(ej, envir = levelsenv) : object 'group_vectorCN' not found

Is that limma automatically set NC as reference so I can run:

When you add the other coefficients to your model, you cannot have separate disease/control coefficients (unless you reorder and put disease as your first coefficient). Instead, the group_vectordisease coefficient is estimating the difference between disease and control, rather than the mean of each group, so you don't need an explicit contrast. Just select that coefficient when you run topTable.

See section 9.2 in the limma User's Guide for an explanation of treatment contrasts versus cell means models.

Thanks James! May I know why the sex covariate has two columns? Meanwhile, the groups also has 2 levels as sex but only has one column in the design matrix?

There's a couple of things to unpack here. By default, if you specify no intercept (~ 0 + ...), then R will try to generate a cell means model, where you compute the mean of each group. But if the resulting design matrix won't be full rank, it will automatically drop coefficients. That's one point, but is primarily of interest for ANOVA models where all your coefficients are factors.

Normally, if you have a continuous variable in your model (you are fitting an ANCOVA model), you want an intercept (otherwise you are forcing the regression line for Age to go through zero, which is probably not what you want, because there is no reason to think that any gene expression is zero at birth). And if you include an intercept, given that the default for R is a treatments contrast parameterization, that means you will be setting one group to be the baseline. If you are fitting an ANOVA, the difference between having an intercept or not is only relevant for coefficient interpretation. But if you are fitting an ANCOVA model, then you should include an intercept.

If this is confusing, you need to do some reading about linear regression in general, and how R does it in particular. Or if that isn't of particular interest, you should find someone local who understands linear regression to help you. Just doing stuff without understanding what you are doing is a recipe for disaster.

Anyway, as an example,

## first some fake data
> d.f <- data.frame(Age = runif(10)*20, Sex = factor(rep(c("M","F"), each = 5)), Disease = rep(rep(c("Disease","Control"), 2), c(2,3,3,2)))
> d.f
Age Sex Disease
1 17.401343 M Disease
2 3.155670 M Disease
3 11.738212 M Control
4 8.422803 M Control
5 16.028492 M Control
6 6.371306 F Disease
7 9.778948 F Disease
8 2.615691 F Disease
9 3.860053 F Control
10 4.631244 F Control
## now some design matrices
> model.matrix(~0 + Age+Sex+Disease, d.f)
Age SexF SexM DiseaseDisease
1 2.940991 0 1 1
2 17.391425 0 1 1
3 16.839601 0 1 0
4 7.066056 0 1 0
5 15.662131 0 1 0
6 11.842755 1 0 1
7 12.147740 1 0 1
8 13.192485 1 0 1
9 1.671015 1 0 0
10 19.328127 1 0 0
## the above design matrix is suboptimal because you are forcing the intercept to be zero
> model.matrix(~Age+Sex+Disease, d.f)
(Intercept) Age SexM DiseaseDisease
1 1 19.458611 1 1
2 1 11.601831 1 1
3 1 8.436780 1 0
4 1 4.147289 1 0
5 1 5.182194 1 0
6 1 18.053081 0 1
7 1 18.313368 0 1
8 1 19.394239 0 1
9 1 1.389276 0 0
10 1 8.681382 0 0
## this one is better

In the second parameterization, the intercept computes the mean of the female control subjects (which we know because all other coefficients in rows 9-10 are zero, except for age), and the SexM coefficient estimates the difference between control males and control females, and the DiseaseDisease estimates the difference between disease females and control females. But since you have already 'adjusted out' the difference between males and females, you use all the data to estimate disease vs control.

Thanks James! May I know why the sex covariate has two columns? Meanwhile, the groups also has 2 levels as sex but only has one column in the design matrix?

There's a couple of things to unpack here. By default, if you specify no intercept (

`~ 0 + ...`

), then R will try to generate a cell means model, where you compute the mean of each group. But if the resulting design matrix won't be full rank, it will automatically drop coefficients. That's one point, but is primarily of interest for ANOVA models where all your coefficients are factors.Normally, if you have a continuous variable in your model (you are fitting an ANCOVA model), you want an intercept (otherwise you are forcing the regression line for Age to go through zero, which is probably not what you want, because there is no reason to think that any gene expression is zero at birth). And if you include an intercept, given that the default for R is a treatments contrast parameterization, that means you will be setting one group to be the baseline. If you are fitting an ANOVA, the difference between having an intercept or not is only relevant for coefficient interpretation. But if you are fitting an ANCOVA model, then you should include an intercept.

If this is confusing, you need to do some reading about linear regression in general, and how R does it in particular. Or if that isn't of particular interest, you should find someone local who understands linear regression to help you. Just doing stuff without understanding what you are doing is a recipe for disaster.

Anyway, as an example,

In the second parameterization, the intercept computes the mean of the female control subjects (which we know because all other coefficients in rows 9-10 are zero, except for age), and the SexM coefficient estimates the difference between control males and control females, and the DiseaseDisease estimates the difference between disease females and control females. But since you have already 'adjusted out' the difference between males and females, you use all the data to estimate disease vs control.

Thank you James for the detailed follow up! I great appreciate it. The first model.matrix is similar to mine with 4 columns. Try your suggestion now!