Question: Method for batch correction
0
3.0 years ago by
AST50
INDIA
AST50 wrote:

Hi,

I have been using Combat to correct the batch effect in 450k data (~10 batches). Recently, I read an old reply from Dr. Peter Langfelder where he mentioned that  "ComBat should NOT be used before running association testing (lmFit); association testing should be run with batch as a covariate on original data."

I have read the paper from Combat authors where they mentioned that Combat performs better than SVD when sample size is small and comparably similar for large sample size.

Now, my question is since my sample size is large and I am using limma for calculating differential methylation, should I be adjusting batch directly in limma as a covariate (or use removeBatchEffect(), not sure about this, as this will again be like removing batch separately)

batch <- pheno$batch BSC <- pheno$BSC_batch
group <- factor(targets$status,levels=c("Control","Case")) design <- model.matrix(~targets$Age+batch+BSC+group)
fit <- lmFit(Mval, design)

or should I continue with Combat for removing batch and limma for adjusting other biological covariates?

MvalC <- ComBat(Mval, batch=batch, mod=NULL, par.prior = TRUE, prior.plots = FALSE)
modcombat <- model.matrix(~target$Age, data=pheno) MvalC1 <- ComBat(MvalC, batch=BSC, mod=NULL, par.prior = TRUE, prior.plots = FALSE) design <- model.matrix(~targets$Age+group)
fit <- lmFit(MvalC1, design)

(I corrected for Chip batch and BSC batch separately, as they were confounded and limma was showing error.)

limma sva combat • 1.8k views
modified 3.0 years ago by Vegard Nygaard110 • written 3.0 years ago by AST50
2

You've answered your own question with Peter's quote (for anyone who's interested, the original is at C: Limma, blocking batch effect). That you have large sample sizes and are doing differential methylation does not change the correctness of his advice.

1

The only part I didn't understand (as he also didn't elaborate it) in that comment is: how does it make difference, if I remove batch first, then adjust for covariates while doing differential analysis?

Also, I have seen many people used limma function removeBatchEffect() to remove batch prior to any differential analysis. I guess that will also not be a correct approach

3

It's got to do with the true number of residual d.f. and the uncertainty of the model parameters. I believe I gave you a more complete answer at C: Adjustment of covariates in 450k data using Limma, but to sum up; batch removal is not perfect. Like any statistical procedure, it is subject to errors (e.g., from imprecise estimation of the batch effect), and if you don't model those errors during the DE analysis, then your analysis will not be correct. Blocking on the batch effect in the linear model will account for those errors and is the right approach; removing the batch effect before modelling is not.

Thanks Aaron, I was also doubting that it has to do with the residual D.f. Thanks once again for clarifying.

3
3.0 years ago by
United States
W. Evan Johnson800 wrote:

My recommendation is to try using ComBat and then also including a batch covariate in your model. This accounts for any residual df problems (mentioned above) and allows you to use the extra benefits of ComBat that are not available by just using a blocking factor (i.e. adjusting for differences in batch variance).

To elaborate further: if there are no variance differences between the batches, adding a blocking factor is just fine--there will be no difference in using ComBat or a blocking factor. However, this is not common--in most cases the batch effect impacts both the means and variances across batches. The greater the differences in variance between batches, the greater benefit you will see from using ComBat+Limma(plus batch covariate) versus Limma (plus batch) alone.

Thank you Evan. Will try to follow your suggestion to see if there are any significant differences in my results.

1
3.0 years ago by
Norway
Vegard Nygaard110 wrote:

My colleagues and I spent considerable time looking into problems with attempting to remove the batch effect, which resulted in the article:
http://biostatistics.oxfordjournals.org/content/17/1/29

I do not disagree with the other answers from Evan and Aaron. I will explain using somewhat different words why you should avoid your second alternative.

ComBat, removeBatchEffect and similar methods can not truly remove the batch effect, nor can they create a batch effect free data set. For that to happen, the true batch effect must be known, which rarely is the case. Instead, as in your example, it is estimated from the data and it is this estimated batch effect that is removed from the data. Thus the difference between the true batch effect and the estimate, i.e. the estimation error, will be present in your adjusted data. This estimation error is in itself a batch effect. So this procedure swaps one batch effect with another, one problem with another quite similar problem. It can be hard to tell to which degree the second batch effect is detrimental, but be aware that it will be there even if you start out with no true batch effect.

So your second analysis alternative is problematic because when you use ComBat you ensure your data has a batch effect (the estimation error). When this batch effect is ignored in your subsequent limma analysis, unreliable p-values may be the result (depending on other circumstances like experiment design)

Your first alternative is more a safe textbook-like way of doing it.

Thank you Vegard. The article surely is very helpful in this regard.

Sorry for the delayed response. I just want to point out that the second alternative DID NOT advocate using ComBat + LIMMA without a batch covariate, which does, as their paper describes, lead to exaggerated significance in studies with unbalanced experimental design.

It DOES advocate using ComBat followed by LIMMA with a batch covariate. Therefore the conditions, results, and concerns of the aforementioned paper do not apply here. The approach suggested above is a perfectly reasonable solution and in many cases outperforms the one step alternative while retaining the appropriate and expected statistical test behaviors.

Hi Evan, piggy backing of this thread since I have the same issue.  So I wanted to be clear what you are suggesting.  If I have an experiment (Drug vs. control) and did this experiment on 3 separate cell lines and mainly care about the effect of drug.  Can I run combat to correct for cell line and then use the output of the matrix in my model correcting for cell line.  For example.

1. correct for batch (cell line)
2. design <-  model.matrix(~0+ key$group + key$cell  )
3.  lmFit( corrrected.matrix , design)

would this work?