What is the correct use of removeBatchEffect's design parameter
2
0
Entering edit mode
Sam ▴ 10
@sam-21502
Last seen 1 day ago
Jerusalem

I am analysing an experiment done on humans using DESEQ2; I want to control for age and BMI, and to visualise the results. In the DESeq2 vignette it says

For unbalanced batches (e.g. the condition groups are not distributed balanced across batches), the design argument should be used, see ?removeBatchEffect in the limma package for details.

As far as I understand, I can check whether the design is balanced by looking at table(Condition, batch)

                 BMI_group
Condition  Normal Obese Overweight
Negative  8     1          1
Positive  2     1          2

Age_group
Condition  Age1 Age2 Age3 Age4
Negative    1    1    3    5
Positive    1    1    0    3


Since the numbers are not equal for each group, the design is unbalanced.

So, I have added a design argument to removeBatchEffect :

dds <- DESeqDataSetFromMatrix(countData, colData, design = ~ 1)
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
rlvals <- as.data.frame(assay(rlog(dds))) # rlog = regularized log

design0 <- model.matrix( ~Condition, data=colData)

rlvals_batch <- removeBatchEffect(rlvals, batch=colData$Age_group, batch2=colData$BMI_group, design=design0)


However, in the resulting PCA plots, the separation according to condition was overwhelming.

Before cleaning the batch effect

After cleaning the batch effect, with the design parameter

Distance heatmap (cleaning batch, with design parameter) :

( When cleaning the batch effect without the design parameter, the PCA has not changed much. )

So, when supplying the design parameter, too much of the variance was cleaned. When is it correct to use the design parameter (in the context of DESeq2), and when it is not?

deseq2 limma • 856 views
2
Entering edit mode
@gordon-smyth
Last seen 3 minutes ago
WEHI, Melbourne, Australia

Repeating Michael's answer, it is always correct to include the design matrix.

The batch-corrected PCA plot tends to give an overstated impression of amount of DE because it gives the appearance that the batch-correct values are independent observations whereas the batch correction has introduced correlations. This is unavoidable however because the PCA plot can't display those correlations.

If you look at your PCA plot closely, you will that PCA2, representing between group variation, explains only 19% of the variation whereas PCA1, representing within-group variation, is much larger. So it's not surprising there are only a few DE genes.

0
Entering edit mode

The batch-corrected PCA plot tends to give an overstated impression of amount of DE because it gives the appearance that the batch-correct values are independent observations whereas the batch correction has introduced correlations. This is unavoidable however because the PCA plot can't display those correlations.

Thanks. I did not understand the answer; could you refer me to read something that would explain the issue?

0
Entering edit mode

No sorry, there isn't a reference on this. But it relates to the fact that you cannot perform batch correction and then input the corrected values to limma or another DE program as if the batch effect never existed. I have posted on that topic several times. There is no algorithm that can remake the data entirely as if the batch effects never existed.

0
Entering edit mode

The batch-corrected PCA plot tends to give an overstated impression of amount of DE because it gives the appearance that the batch-correct values are independent observations whereas the batch correction has introduced correlations. This is unavoidable however because the PCA plot can't display those correlations.

Could you explain what correlation the batch correction introduces, which modelling the batch effect does not introduce (between genes, or between samples) ?

1
Entering edit mode

Batch correction and modelling the batch effect are the same thing, they both introduce the same correlations. The difference is that DE analysis uses the correlations in context whereas plotting the batch corrected values ignores the correlations.

The clear separation between your batch corrected points is not an error and is not caused by the correlations I refered to. It is a real feature of your data. The only effect of the correlations is that the separation is slightly less significant than it appears to be by eye, because the number of independent observations is not as large as it appears to be.

The software has worked correctly and has done what it is intended to do.

0
Entering edit mode

The batch-corrected PCA plot tends to give an overstated impression of amount of DE because it gives the appearance that the batch-correct values are independent observations whereas the batch correction has introduced correlations.

Is this is a general statement - not dependent on the specific method used to correct for batch (removeBatchEffect/ComBat/Combat-Seq), or is it true for removeBatchEffect specifically?

1
Entering edit mode

Yes, it is a general statement. There is no algorithm that can remake the data entirely as if the batch effects never existed.

0
Entering edit mode

it is always correct to include the design matrix.

By including the design matrix, we are correcting the variance of the batch effect, but without correcting for the terms included in the design matrix, right?

It seems to me intuitively that in doing this we already assume that the terms included in the design matrix have an effect on the data. So in some sense we fit the data to behave according to the design matrix. Since the terms that appear in the design matrix are those which we want to test using DE, this seems like overfitting.

Do you have the time to explain why my intuition is incorrect?

1
Entering edit mode

I will add one last comment.

In an unbalanced design with covariates, the effects can be decomposed into three parts:

1. Effects unambiguously due to the treatment condition and not the covariate;
2. Effects that could be due to either the condition or the covariates;
3. Effects unambiguously due to the covariates and not the treatment condition.

When you do a DE analysis in limma, edgeR or DESeq2 with covariates in the model then you are testing only for 1. Significance testing is conservative in this sense that it discounts the ambiguous effects in category 2. If the covariates and treatment groups are strongly associated with one another then it becomes impossible to detect any treatment effect because all the effects will become part of the ambiguous category 2.

On the other hand, when you run removeBatchEffect with design and covariates then you remove only 3, so both 1 and 2 remain in the plot. The batch-corrected plot is preserving all the effect that might be due to the treatment, including the ambiguous part, hence the batch-corrected plot will in general show stronger treatment effects than will be detected by the significance analysis.

If you want to view a graph that accurately displays how adjusting for covariates works in the context of the DE analysis, then you need something called an added-variable-plot. But added variable plots would need to be constructed for each gene individually and tend to be too subtle for non-experts to use and interpret.

If you read my answer on Cross-Validated explaining added-variable-plots, have a look the second plot (R vs X) that I show in which the true relationship between Y and X has been destroyed. Running removeBatchEffect without the design matrix would be equivalent to that.

0
Entering edit mode

There is no assumption that the design matrix necessarily affects the data.

Batching correcting without the design matrix may produce misleading results because it may remove genuine non-batch effects in the data, if they exist. I do not see any purpose in doing that.

Frankly, it appears that your data contains lots of uncontrolled unbalanced covariates. In such an experiment, there may be ambiguity about which covariate or factor is causal. We cannot offer you magic solutions that will make the ambiguity go away.

0
Entering edit mode

To avoid confusion, I've changed this piece of code in the DESeq2 vignette so it just always uses the design argument.

2
Entering edit mode
@mikelove
Last seen 6 minutes ago
United States

You can actually always use the design parameter, but especially it is important for unbalanced designs.

Why do you say “too much” of the variance due to the nuisance variables was removed? I’m confused why you think it is too much.

0
Entering edit mode

First, a technical reason : when performing differential expression with DESeq2, and supplying this two variables as batch

dds <- DESeqDataSetFromMatrix(countData, colData, design = ~ Age_group + BMI_group + Condition)
...
dds <- DESeq(dds)


only 5 genes turned out to be significant (compared to zero genes, when batch correction was not used).

Judging by the PCA, I thought that the amount should have been much bigger as the difference is really pronounced on the second PC.

Also, frankly, the PCA seems just too good to be true. This is noisy human data, and I did not expect such obvious differences.

I have added a distance heatmap plot to the question (if that helps).

2
Entering edit mode

I think the two signals you are seeing (the PCA separation and the DE gene list) are not totally incompatible. It's possible to have a number of genes with some signal, though not surviving FDR correction, which when aggregating across genes provide separation in the top 500 genes by variance.

0
Entering edit mode

Could you suggest how can one verify that it'is indeed the case?

0
Entering edit mode

In terms of verifying that you don't have power to survive FDR correction, you have the evidence which is the DESeq2 results.

And in terms of there nevertheless being aggregate signal in the top 500 genes, you have the evidence which is the PCA plot.

0
Entering edit mode

Hi, What do you mean by aggregation across genes? I have re-read your article on PCA and did not see a mention of aggregation (in PCA) there.

( I would like to understand this point and to verify that this is indeed the case in my data -and not a mistake in my code)

0
Entering edit mode

Is there a way to get a feeling how modeling the batch effect in DESeq2 works on my data, other than ComBat or removeBatchCorrection?

For example, after adjusting for two batch effects, a specific gene got a log fold change of -30. I would like to get a feeling why & how it happened. It isn't clear to me from simulating it via removeBatchEffect, as some of its expression values become negative, and I don't know how to think of those.