Could I please get an opinion on whether a batch correction is recommended in my data?
Background: I have performed RNAseq gene expression analysis on 2 condition groups. The data was normalised using the rlog function (DEseq2) (named "Unbatched" in the PCA plot). Subsequently, the normalised data was batch corrected using the removeBatcheffect function (edgeR) (named "Batchcorrected")
Problem: The problem is that I have 4 different batches where the condition groups are not evenly distributed across the batches (particularly in batch 3 and 4, shown below). I have read that this may cause incorrect downstream analysis.
batch Group1 Group2
1 4 2
2 4 2
3 2 0
4 0 3
I've included 2 PCA plots. This includes 1) a PCA plot using all genes in the data (17,966 genes) and 2) a PCA plot using 500 genes with the highest variance in the data.
When looking at the PCA plot made using all genes, a batch effect can be seen however not much when using the top 500 genes. Given that there is an uneven distribution of sample groups across the batches, would you recommend adjusting for batch effects when performing a differential expression analysis in edgeR?
Thank you :)