Hi everyone,

I have some microarray data files containing two sets of samples in normal and disease condition. I have tested that the data also contain significant batch effects with hybridization time. However, the positive hits I obtained using the following approaches are very different (using the same cutoff value in decideTests function). I think I am supposed to use the first approach but I am surprised to see a big difference between the two approaches. could anyone help figure out the reasons?

Thanks,

Zhi Xie

NIH/NHLBI

Here eset is the expression dataset after RMA function.

**Approach 1**:

# Consider batch effects in the model matrix design<-model.matrix(~0+condition.factor+batch.factor) # fit the linear model fit<-lmFit(eset,design)

Then I create contrast matrix and compute coefficients and errors using contrast.fit function

**Approach 2**:

# remove batch effects first exp.eset.rm.batch<-removeBatchEffect(exprs(eset),batch.factor) # only consider normal and disease conditions in the model matrix design<-model.matrix(~0+condition.factor) # fit the linear model fit<-lmFit(eset.rm.batch,design)

where eset.rm.batch is the expression dataset containing expression values from exp.eset.rm.batch table

Dear Dr. Smyth,

As Mike pointed out, here is the NOTE on help page of removeBatchEffect: "

This function is not intended to be used prior to linear modelling. For linear modelling, it is better to include the batch factors in the linear model.".In the example above (Answer to Xie Zhi), which you named

Approach 3, you have actually used removeBatchEffect functionbeforelinear modelling. Not only that, I have been following your answers to numerous questions on BioC and elsewhere where you have equivocally stated that this function should only be considered for plots like PCA, MDS etc and not for DGE. I am sure many including myself would be in dilemma at this moment. Could you please elaborate and clear the confusion?Thank you

Sandip

Yes, I wrote the NOTE, and it still seems to me to give you perfectly clear advice.

I already told you in my answer above that Approach 2, which does not use removeBatchCorrect, is correct. In Approach 3, the removeBatchEffect() step is superfluous, but doesn't do any harm because the batch factor is included in the linear model as well, which is what the NOTE on the help page tells you to do. I included the 3rd approach in my answer just to point out that it gives the same results as Approach 1 for the conditions of interest, hoping that would have pedagogic value. Obviously one wouldn't use Approach 3 is practice because it is equivalent to Approach 1 but includes an extra unnecessary step, as well as making it harder to assess the importance of the batch adjustment.

That's what I thought. Thanks for clarification and that too so promptly.

Sandip

Hi, I am a little late to this conversation. Can you build a design matrix like the removeBatchEffect does and implement into the lmfit for DE analysis

contrasts(batch) <- contr.sum(levels(batch))

batch <- model.matrix(~batch)[, -1, drop = FALSE]

design <- model.matrix(~0+treatment)

design_matrix <- cbind(design, batch)

fit <- lmfit(matrix, design_matrix)

or would it just be more acceptable to do it the standard way

design <- model.matrix(~0+treatment+batch)

fit <- lmfit(matrix, design)

Thanks for any feedback! Leanne