Question

removeBatchEffect: Coefficients not estimable when using "batch matrix" for `covariates` argument.

0

Entering edit mode

nhaus ▴ 30

@789c70a6

Last seen 22 hours ago

Switzerland

Hello,

I am trying to remove a 3 known batch effects from my data, so I can use it to perform some clustering. In total I have around 150 patients. 2 of the batch effects are categorical variables (Gender (male & female) and Technician(A,B &C)) and one is continuous (Age).

Usually, when using limma::removeBatchEffect, I create a model matrix that includes all my batch effects like this and then pass this to the covariates argument like this:

batch.matrix <- model.matrix(~0+Gender+Age+Technician, data=metadata) 
corrected_counts <- removeBatchEffect(raw_counts, covariates=batch.matrix)

However, this results in the following warning: Coefficients not estimable: GenderMale

If I rearrange the batch.matrix like this batch.matrix <- model.matrix(~0+Technician+Age+Gender, data=metadata) the error turns to this:

Coefficients not estimable: TechnicianB

I noticed the same warning, when the variables that I tried to correct were confounded, but this is not the case this time.

Interestingly, if I try to remove the batch effect like this:

corrected_counts <- removeBatchEffect(raw_counts, batch = metadata$Gender, 
                                      batch2 = metadata$Technician, 
                                      covariates = metadata$Age)

I do not receive a warning and everything works as expected.

I would really appreciate, if anyone has some insights, why the behavior occurs or would the reason for it could be!

Thanks!

limma BatchEffect • 1.3k views

ADD COMMENT • link updated 23 months ago by Gordon Smyth 50k • written 23 months ago by nhaus ▴ 30

score 2 · Accepted Answer · 2022-06-01

That's because batch.matrix is confounded with the intercept term. If you want to load all the batch variables into the covariate matrix, then you need:

batch.matrix <- model.matrix(~Gender+Age+Technician, data=metadata)[, -1]

It is never correct to use 0+ in this context.

Although the above approach will work, the second approach where you distinguish categorical batches from continuous covariates is better. Both approaches should give the same clustering, but the second approach changes the original expression values less.