In RNAseq experiments, batch effects are very strong. Here two situations I have observed often:
1. The same cell line is used to replicate an experiment and results between two experiments are quite different (see case1 plot).
2. Two groups of patients (in this case three and two) are sampled at two different points in time, and they cluster according to sampling date.
I suspect that sequencing is quite different between different runs. We have seen this effect when sequencing bacterial genomes too. In a collection of 96 samples prepared simultaneously, we sequenced separately 48 and 48, their sequences exhibit a quite different profile and they cluster exactly according to sequencing batch.
What to do with DESeq2, should the data be analyzed together, including the factor "batch" in the model? Something like:
ddsMat <- DESeqDataSetFromMatrix(countData = countdata, colData = sampleInfo, design =~ group + batch
OR, should both batches be analyzed separately?
Links of the above-mentioned figures follow: