Question

EdgeR analysis batch effect

0

Entering edit mode

rfenouil • 0

@rfenouil-17300

Last seen 6.3 years ago

Dear all,

I analyzed a batch of samples using edgeR in order to isolate differentially expressed genes from a bacteria strain. Two experimental conditions were varying (growing phase and temperature), and I had at least 3 replicates for every set of conditions. We are mostly interested in temperature effect but I also extracted DE genes between different growing phases (with blocking of temperature condition).

Everything went fine until I received a second batch of data with additional temperature values: one control condition (temp1, same as in batch01) for consistency check, and a new temperature value. QC look fine (reads quality, alignment, etc...), and a quick look with a genome browser showed no sign of excessive noise (see picture for example). However, samples from second batch were sequenced much deeper (bottom, ~25 million reads) than ones from first batch (top, ~7 million reads).

IGB view

In order to check for consistency between batches, I generated an MDS plot from normalized cpm counts (see figure).

MDS plot

From this picture, we can see that samples for control condition ('Temp1', common between 2 batches) do not cluster together between batches (small VS large points) for 'G2'. I suspect a pronounced batch effect.

If you confirm that this looks like a batch-effect problem, what would be the best strategy for correcting it (if possible) before EdgeR analysis ?

Is it a problem to have only a single temperature in common between batches ?

I read about limma's "removeBatchEffect()" function but I would like confirmation that I can use it in this context. Actually, I suspect that it would forbid any use of my edgeR pipeline because of the transformation from counts to continuous data.

I also found references to Voom batch effect modeling, would it be more relevant than any EdgeR approach in this context ?

Thank you very much for your help.

R.

edger differential gene expression • 2.4k views

ADD COMMENT • link updated 6.3 years ago by Aaron Lun ★ 28k • written 6.3 years ago by rfenouil • 0

score 2 · Answer 1 · 2018-09-12

2

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 3 hours ago

The city by the bay

This does look like a batch effect. Fortunately, it seems you have groups that are present in both batches (i.e., Temp1 G1 and G2). This means that the experimental design is not completely confounded. You can include a batch factor in your design matrix, and edgeR will automatically consider the batch effect when performing comparisons. This is possible as the batch effect is effectively estimated from the inter-batch differences within the Temp1 G1/G2 groups. There are a number of examples of blocking factors in the edgeR user's guide, see Section 3.4 for details.

Whatever you do, do not try to remove the batch effects from the raw data prior to running edgeR. Functions like removeBatchEffect are intended for removing the batch effect prior to using model-free methods (e.g., clustering, dimensionality reduction). It is much better to model the batch effect explicitly by including a blocking factor in the design matrix, as this ensures that the uncertainty of the batch effect estimate is properly considered during hypothesis testing.

ADD COMMENT • link 6.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you very much for the clear answer.

For convenience reasons (automatic analyses), I tend use edgeR in a weird way. The strategy consists in creating a generic design matrix (each combination of conditions makes a column). Then, at the analysis stage, the user can query the fitted model with constrasts of interest.

Here is an example of such a "generic" design matrix (from a simpler case than above):

G1_Temp1 G1_Temp2 G2_Temp1 G2_Temp2
1 0 0 0
1 0 0 0
1 0 0 0
0 1 0 0
0 1 0 0
0 1 0 0
0 1 0 0
0 0 1 0
0 0 1 0
0 0 1 0
0 0 0 1
0 0 0 1
0 0 0 1

I thought about "blocking" at the contrast stage using the following strategy.

Example of contrasts for temp1 VS temp2 with blocking of G1/G2:
G1_Temp1 G1_Temp2 G2_Temp1 G2_Temp2
1 1 -1 -1

The idea being that eventual differences between G1 and G2 will be accounted for in the variance estimation.

Example of contrasts for temp1 VS temp2 without blocking of G1/G2:

G1_Temp1 G1_Temp2 G2_Temp1 G2_Temp2
1 -1 0 0
0 0 1 -1

Does that sound correct, and can I use the same strategy for correcting batch effect bias in my above example (i.e. adding batch1 and batch2 samples together in the contrasts) ?

ADD REPLY • link 6.3 years ago rfenouil • 0

0

Entering edit mode

Your simplified example and the associated contrasts are not relevant to your actual problem. Remember that you are trying the get rid of the batch effect. However, your design matrix doesn't even have a term describing the batch-of-origin for each sample. This means that, no matter what contrasts you set up, you will not be able to remove the batch effect, because the model doesn't even know about it in the first place!

Don't fight the system - just follow the instructions in the user's guide. I would also add that automatic construction of design matrices is quite difficult beyond the most simple experimental designs. It's not just the formulation of the model.matrix call, it's also the problem of ensuring that the interpretation of each coefficient is correct, as well as checking that the matrix is of full rank. It's hard enough that my pipelines would start with a user-supplied design matrix rather than trying to make it automatically.

ADD REPLY • link 6.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

I realized my comment becomes off topic so I created a new post for this specific question.

Thank you for your help.

ADD REPLY • link 6.3 years ago rfenouil • 0