Dear all,
I analyzed a batch of samples using edgeR in order to isolate differentially expressed genes from a bacteria strain. Two experimental conditions were varying (growing phase and temperature), and I had at least 3 replicates for every set of conditions. We are mostly interested in temperature effect but I also extracted DE genes between different growing phases (with blocking of temperature condition).
Everything went fine until I received a second batch of data with additional temperature values: one control condition (temp1, same as in batch01) for consistency check, and a new temperature value. QC look fine (reads quality, alignment, etc...), and a quick look with a genome browser showed no sign of excessive noise (see picture for example). However, samples from second batch were sequenced much deeper (bottom, ~25 million reads) than ones from first batch (top, ~7 million reads).
In order to check for consistency between batches, I generated an MDS plot from normalized cpm counts (see figure).
From this picture, we can see that samples for control condition ('Temp1', common between 2 batches) do not cluster together between batches (small VS large points) for 'G2'. I suspect a pronounced batch effect.
If you confirm that this looks like a batch-effect problem, what would be the best strategy for correcting it (if possible) before EdgeR analysis ?
Is it a problem to have only a single temperature in common between batches ?
I read about limma's "removeBatchEffect()" function but I would like confirmation that I can use it in this context. Actually, I suspect that it would forbid any use of my edgeR pipeline because of the transformation from counts to continuous data.
I also found references to Voom batch effect modeling, would it be more relevant than any EdgeR approach in this context ?
Thank you very much for your help.
R.
Thank you very much for the clear answer.
For convenience reasons (automatic analyses), I tend use edgeR in a weird way. The strategy consists in creating a generic design matrix (each combination of conditions makes a column). Then, at the analysis stage, the user can query the fitted model with constrasts of interest.
Here is an example of such a "generic" design matrix (from a simpler case than above):
G1_Temp1 G1_Temp2 G2_Temp1 G2_Temp2
1 0 0 0
1 0 0 0
1 0 0 0
0 1 0 0
0 1 0 0
0 1 0 0
0 1 0 0
0 0 1 0
0 0 1 0
0 0 1 0
0 0 0 1
0 0 0 1
0 0 0 1
I thought about "blocking" at the contrast stage using the following strategy.
Example of contrasts for temp1 VS temp2 with blocking of G1/G2:
G1_Temp1 G1_Temp2 G2_Temp1 G2_Temp2
1 1 -1 -1
The idea being that eventual differences between G1 and G2 will be accounted for in the variance estimation.
Example of contrasts for temp1 VS temp2 without blocking of G1/G2:
G1_Temp1 G1_Temp2 G2_Temp1 G2_Temp2
1 -1 0 0
0 0 1 -1
Does that sound correct, and can I use the same strategy for correcting batch effect bias in my above example (i.e. adding batch1 and batch2 samples together in the contrasts) ?
Your simplified example and the associated contrasts are not relevant to your actual problem. Remember that you are trying the get rid of the batch effect. However, your design matrix doesn't even have a term describing the batch-of-origin for each sample. This means that, no matter what contrasts you set up, you will not be able to remove the batch effect, because the model doesn't even know about it in the first place!
Don't fight the system - just follow the instructions in the user's guide. I would also add that automatic construction of design matrices is quite difficult beyond the most simple experimental designs. It's not just the formulation of the
model.matrix
call, it's also the problem of ensuring that the interpretation of each coefficient is correct, as well as checking that the matrix is of full rank. It's hard enough that my pipelines would start with a user-supplied design matrix rather than trying to make it automatically.I realized my comment becomes off topic so I created a new post for this specific question.
Thank you for your help.