I am analysing an experiment done on humans using DESEQ2; I want to control for age and BMI, and to visualise the results. In the DESeq2 vignette it says
For unbalanced batches (e.g. the condition groups are not distributed balanced across batches), the design argument should be used, see ?removeBatchEffect in the limma package for details.
As far as I understand, I can check whether the design is balanced by looking at table(Condition, batch)
BMI_group Condition Normal Obese Overweight Negative 8 1 1 Positive 2 1 2 Age_group Condition Age1 Age2 Age3 Age4 Negative 1 1 3 5 Positive 1 1 0 3
Since the numbers are not equal for each group, the design is unbalanced.
So, I have added a design argument to removeBatchEffect :
dds <- DESeqDataSetFromMatrix(countData, colData, design = ~ 1) dds <- estimateSizeFactors(dds) dds <- estimateDispersions(dds) rlvals <- as.data.frame(assay(rlog(dds))) # rlog = regularized log design0 <- model.matrix( ~Condition, data=colData) rlvals_batch <- removeBatchEffect(rlvals, batch=colData$Age_group, batch2=colData$BMI_group, design=design0)
However, in the resulting PCA plots, the separation according to condition was overwhelming.
Before cleaning the batch effect
After cleaning the batch effect, with the design parameter
Distance heatmap (cleaning batch, with design parameter) :
( When cleaning the batch effect without the design parameter, the PCA has not changed much. )
So, when supplying the design parameter, too much of the variance was cleaned. When is it correct to use the design parameter (in the context of DESeq2), and when it is not?