Batch Effect Correction
Entering edit mode
Antonio Ahn ▴ 10
Last seen 8 months ago
University of Otago

Hello everyone, 

Could I please get an opinion on whether a batch correction is recommended in my data? 

Background: I have performed RNAseq gene expression analysis on 2 condition groups. The data was normalised using the rlog function (DEseq2)  (named "Unbatched" in the PCA plot). Subsequently, the normalised data was batch corrected using the removeBatcheffect function (edgeR) (named "Batchcorrected")

Problem: The problem is that I have 4 different batches where the condition groups are not evenly distributed across the batches (particularly in batch 3 and 4, shown below). I have read that this may cause incorrect downstream analysis. 

batch   Group1 Group2
    1           4            2
    2           4            2
             2            0
    4           0            3

I've included 2 PCA plots. This includes 1) a PCA plot using all genes in the data (17,966 genes) and 2) a PCA plot using 500 genes with the highest variance in the data. 


When looking at the PCA plot made using all genes, a batch effect can be seen however not much when using the top 500 genes. Given that there is an uneven distribution of sample groups across the batches, would you recommend adjusting for batch effects when performing a differential expression analysis in edgeR? 

Thank you :) 



batch effect removebatcheffect edgeR rld DEseq2 • 2.4k views
Entering edit mode
Aaron Lun ★ 27k
Last seen 11 hours ago
The city by the bay

Include the batch factor when creating a design matrix to run in edgeR. Not doing so would be incorrect; there is clearly a batch effect in all of your PCA plots. I don't know why you think that the batch effect is "not much" in the PCA plot generated with the top 500 highly variable genes, the batch effect looks pretty obvious to me.

P.S. You should explain the point aesthetics in your PCA plot. I'm guessing that shapes correspond to batches and colour represents your experimental condition, but good questions should not require guesswork.

P.P.S. Showing code is always better than saying what you (think you) did. See the posting guide for details.

Entering edit mode

Hi Aaron :)

Thank you so much for your response and help. Could I please ask another question? I have noticed that in my top differentially expressed genes (long non-coding RNA genes), that the batch effect adjustment using the removeBatcheffect function (edgeR) significantly alters the expression values for some samples. This was more pronounced for samples in batch 4 where there is an uneven sample distribution.

For example, in this boxplot, there are 3 samples in group 2 where the expression increases significantly when removeBatcheffect was used however not with ComBat. This was the case for most of the top differentially expressed genes found by including the batch factor in the design matrix with edgeR. 

Could you please shed some light on this matter? Thank you! 


These are the codes that were used:

# s2c is my sample to conditions table
# Generating the batch effect adjusted matrix using removeBatchEffect
status <- s2c$condition2 %>% factor
design <- model.matrix(~0 + status)
colnames(design) <- gsub("status", "", colnames(design))
TPMrbe <- removeBatchEffect(TPMmatrix , batch = s2c$batch, design = design)

# Generating the batch effect adjusted matrix using ComBat
batch = factor(s2c$batch)
modcombat = model.matrix(~1, data = s2c)
modPDL1 = model.matrix(~condition1, data = s2c)
TPMcombat <- ComBat(TPMmatrix, batch = batch , mod = modcombat, par.prior=TRUE, prior.plots=FALSE)

# Performing differential expression analysis using edgeR.
status <- factor(s2c$condition2)
batch <- factor(s2c$batch)
design <- model.matrix(~0 + status + batch)
colnames(design) <- gsub("status", "", colnames(design))
xE <- DGEList(counts = countsMatrix, group = status)

xE <- calcNormFactors(xE, method = "TMM")
xE <- estimateDisp(xE, design, robust = TRUE) 
fit <- glmQLFit(xE, design, robust = TRUE)
contr.matrix <- makeContrasts( group = group2 - group1, levels = design)
resE <- glmTreat(fit, contrast = contr.matrix, lfc=1)
ttE <- topTags(resE , n=Inf, p.value=0.05, adjust.method="BH")
Entering edit mode

Well, for starters, you're using condition1 for ComBat and condition2 for removeBatchEffect.

Entering edit mode

Hi Aaron :)

Sorry about that confusion. Condition1 and condition2 are the same. I was experimenting with different groupings before and forgot to fix that. 

Also, I noticed that the above-mentioned expression changes from removeBatcheffect are occurring in genes that are lowly expressed. 



Entering edit mode

Assuming that condition1 and condition2 are truly identical (i.e., both factors), all I can say is that ComBat performs some moderation on the batch effects, with the aim of stabilizing the batch effect estimates by sharing information across genes. If the moderation is strong, the observed batch effect for some genes will not be fully removed. (Whether the true batch effect is removed or not is another question.) Unfortunately, it's impossible to infer anything from your plots; I can't make out the labels or the batch identity for each point.


Login before adding your answer.

Traffic: 526 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6