Hello everyone,
Could I please get an opinion on whether a batch correction is recommended in my data?
Background: I have performed RNAseq gene expression analysis on 2 condition groups. The data was normalised using the rlog function (DEseq2) (named "Unbatched" in the PCA plot). Subsequently, the normalised data was batch corrected using the removeBatcheffect function (edgeR) (named "Batchcorrected")
Problem: The problem is that I have 4 different batches where the condition groups are not evenly distributed across the batches (particularly in batch 3 and 4, shown below). I have read that this may cause incorrect downstream analysis.
condition
batch Group1 Group2
1 4 2
2 4 2
3 2 0
4 0 3
I've included 2 PCA plots. This includes 1) a PCA plot using all genes in the data (17,966 genes) and 2) a PCA plot using 500 genes with the highest variance in the data.
When looking at the PCA plot made using all genes, a batch effect can be seen however not much when using the top 500 genes. Given that there is an uneven distribution of sample groups across the batches, would you recommend adjusting for batch effects when performing a differential expression analysis in edgeR?
Thank you :)
Hi Aaron :)
Thank you so much for your response and help. Could I please ask another question? I have noticed that in my top differentially expressed genes (long non-coding RNA genes), that the batch effect adjustment using the removeBatcheffect function (edgeR) significantly alters the expression values for some samples. This was more pronounced for samples in batch 4 where there is an uneven sample distribution.
For example, in this boxplot, there are 3 samples in group 2 where the expression increases significantly when
removeBatcheffect was used
however not withComBat
. This was the case for most of the top differentially expressed genes found by including the batch factor in the design matrix with edgeR.Could you please shed some light on this matter? Thank you!
These are the codes that were used:
Well, for starters, you're using
condition1
forComBat
andcondition2
forremoveBatchEffect
.Hi Aaron :)
Sorry about that confusion. Condition1 and condition2 are the same. I was experimenting with different groupings before and forgot to fix that.
Also, I noticed that the above-mentioned expression changes from
removeBatcheffect
are occurring in genes that are lowly expressed.Assuming that
condition1
andcondition2
are truly identical (i.e., both factors), all I can say is that ComBat performs some moderation on the batch effects, with the aim of stabilizing the batch effect estimates by sharing information across genes. If the moderation is strong, the observed batch effect for some genes will not be fully removed. (Whether the true batch effect is removed or not is another question.) Unfortunately, it's impossible to infer anything from your plots; I can't make out the labels or the batch identity for each point.