removeBatchEffect: Coefficients not estimable when using "batch matrix" for `covariates` argument.
Entering edit mode
nhaus • 0
Last seen 3 months ago


I am trying to remove a 3 known batch effects from my data, so I can use it to perform some clustering. In total I have around 150 patients. 2 of the batch effects are categorical variables (Gender (male & female) and Technician(A,B &C)) and one is continuous (Age).

Usually, when using limma::removeBatchEffect, I create a model matrix that includes all my batch effects like this and then pass this to the covariates argument like this:

batch.matrix <- model.matrix(~0+Gender+Age+Technician, data=metadata) 
corrected_counts <- removeBatchEffect(raw_counts, covariates=batch.matrix)

However, this results in the following warning: Coefficients not estimable: GenderMale

If I rearrange the batch.matrix like this batch.matrix <- model.matrix(~0+Technician+Age+Gender, data=metadata) the error turns to this:

Coefficients not estimable: TechnicianB

I noticed the same warning, when the variables that I tried to correct were confounded, but this is not the case this time.

Interestingly, if I try to remove the batch effect like this:

corrected_counts <- removeBatchEffect(raw_counts, batch = metadata$Gender, 
                                      batch2 = metadata$Technician, 
                                      covariates = metadata$Age)

I do not receive a warning and everything works as expected.

I would really appreciate, if anyone has some insights, why the behavior occurs or would the reason for it could be!


limma BatchEffect • 243 views
Entering edit mode
Last seen 8 hours ago
WEHI, Melbourne, Australia

That's because batch.matrix is confounded with the intercept term. If you want to load all the batch variables into the covariate matrix, then you need:

batch.matrix <- model.matrix(~Gender+Age+Technician, data=metadata)[, -1]

It is never correct to use 0+ in this context.

Although the above approach will work, the second approach where you distinguish categorical batches from continuous covariates is better. Both approaches should give the same clustering, but the second approach changes the original expression values less.


Login before adding your answer.

Traffic: 353 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6