I analyzed a batch of samples using edgeR in order to isolate differentially expressed genes from a bacteria strain. Two experimental conditions were varying (growing phase and temperature), and I had at least 3 replicates for every set of conditions. We are mostly interested in temperature effect but I also extracted DE genes between different growing phases (with blocking of temperature condition).
Everything went fine until I received a second batch of data with additional temperature values: one control condition (temp1, same as in batch01) for consistency check, and a new temperature value. QC look fine (reads quality, alignment, etc...), and a quick look with a genome browser showed no sign of excessive noise (see picture for example). However, samples from second batch were sequenced much deeper (bottom, ~25 million reads) than ones from first batch (top, ~7 million reads).
In order to check for consistency between batches, I generated an MDS plot from normalized cpm counts (see figure).
From this picture, we can see that samples for control condition ('Temp1', common between 2 batches) do not cluster together between batches (small VS large points) for 'G2'. I suspect a pronounced batch effect.
If you confirm that this looks like a batch-effect problem, what would be the best strategy for correcting it (if possible) before EdgeR analysis ?
Is it a problem to have only a single temperature in common between batches ?
I read about limma's "removeBatchEffect()" function but I would like confirmation that I can use it in this context. Actually, I suspect that it would forbid any use of my edgeR pipeline because of the transformation from counts to continuous data.
I also found references to Voom batch effect modeling, would it be more relevant than any EdgeR approach in this context ?
Thank you very much for your help.