I am analysing an experiment done on humans using DESEQ2; I want to control for age and BMI, and to visualise the results. In the DESeq2 vignette it says

For unbalanced batches (e.g. the condition groups are not distributed balanced across batches), the design argument should be used, see ?removeBatchEffect in the limma package for details.

As far as I understand, I can check whether the design is balanced by looking at table(Condition, batch)

```
BMI_group
Condition Normal Obese Overweight
Negative 8 1 1
Positive 2 1 2
Age_group
Condition Age1 Age2 Age3 Age4
Negative 1 1 3 5
Positive 1 1 0 3
```

Since the numbers are not equal for each group, the design is unbalanced.

So, I have added a design argument to removeBatchEffect :

```
dds <- DESeqDataSetFromMatrix(countData, colData, design = ~ 1)
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
rlvals <- as.data.frame(assay(rlog(dds))) # rlog = regularized log
design0 <- model.matrix( ~Condition, data=colData)
rlvals_batch <- removeBatchEffect(rlvals, batch=colData$Age_group, batch2=colData$BMI_group, design=design0)
```

However, in the resulting PCA plots, the separation according to condition was overwhelming.

Before cleaning the batch effect

After cleaning the batch effect, with the design parameter

Distance heatmap (cleaning batch, with design parameter) :

( When cleaning the batch effect **without** the design parameter, the PCA has not changed much. )

So, when supplying the design parameter, too much of the variance was cleaned. When is it correct to use the design parameter (in the context of DESeq2), and when it is not?

Thanks. I did not understand the answer; could you refer me to read something that would explain the issue?

No sorry, there isn't a reference on this. But it relates to the fact that you cannot perform batch correction and then input the corrected values to limma or another DE program as if the batch effect never existed. I have posted on that topic several times. There is no algorithm that can remake the data entirely as if the batch effects never existed.

Could you explain what correlation the batch correction introduces, which modelling the batch effect does not introduce (between genes, or between samples) ?

Batch correction and modelling the batch effect are the same thing, they both introduce the same correlations. The difference is that DE analysis uses the correlations in context whereas plotting the batch corrected values ignores the correlations.

The clear separation between your batch corrected points is not an error and is not caused by the correlations I refered to. It is a real feature of your data. The only effect of the correlations is that the separation is slightly less significant than it appears to be by eye, because the number of independent observations is not as large as it appears to be.

The software has worked correctly and has done what it is intended to do.

Is this is a general statement - not dependent on the specific method used to correct for batch (removeBatchEffect/ComBat/Combat-Seq), or is it true for removeBatchEffect specifically?

Yes, it is a general statement.

There is no algorithm that can remake the data entirely as if the batch effects never existed.By including the design matrix, we are correcting the variance of the batch effect, but without correcting for the terms included in the design matrix, right?

It seems to me intuitively that in doing this we already assume that the terms included in the design matrix have an effect on the data. So in some sense we fit the data to behave according to the design matrix. Since the terms that appear in the design matrix are those which we want to test using DE, this seems like overfitting.

Do you have the time to explain why my intuition is incorrect?

I will add one last comment.

In an unbalanced design with covariates, the effects can be decomposed into three parts:

When you do a DE analysis in limma, edgeR or DESeq2 with covariates in the model then you are testing only for 1. Significance testing is conservative in this sense that it discounts the ambiguous effects in category 2. If the covariates and treatment groups are strongly associated with one another then it becomes impossible to detect any treatment effect because all the effects will become part of the ambiguous category 2.

On the other hand, when you run removeBatchEffect with design and covariates then you remove only 3, so both 1 and 2 remain in the plot. The batch-corrected plot is preserving all the effect that

mightbe due to the treatment, including the ambiguous part, hence the batch-corrected plot will in general show stronger treatment effects than will be detected by the significance analysis.If you want to view a graph that accurately displays how adjusting for covariates works in the context of the DE analysis, then you need something called an added-variable-plot. But added variable plots would need to be constructed for each gene individually and tend to be too subtle for non-experts to use and interpret.

If you read my answer on Cross-Validated explaining added-variable-plots, have a look the second plot (R vs X) that I show in which the true relationship between Y and X has been destroyed. Running removeBatchEffect without the design matrix would be equivalent to that.

There is no assumption that the design matrix necessarily affects the data.

Batching correcting without the design matrix may produce misleading results because it may remove genuine non-batch effects in the data, if they exist. I do not see any purpose in doing that.

Frankly, it appears that your data contains lots of uncontrolled unbalanced covariates. In such an experiment, there may be ambiguity about which covariate or factor is causal. We cannot offer you magic solutions that will make the ambiguity go away.

To avoid confusion, I've changed this piece of code in the DESeq2 vignette so it just always uses the design argument.