Question

Is ComBat introducing a signal where there is none?

2

Entering edit mode

Vegard Nygaard ▴ 120

@vegard-nygaard-6284

Last seen 5.9 years ago

Norway

Hi bioconductorlist, I am trying to combine microarray data from different experiments(batches) where some of my batches fall completely in one condition (normal/cancer). For this I try to use ComBat in the sva package. After playing around with my data and ComBat it seemed to me that ComBat might introduce more signal in my data than there really is. The example given by ComBat-help (?ComBat) has a batch and condition setup much like my data, so I will use that to demonstrate. I will substitute the real data with random data ensuring no batch effect or condition effect. Then I will do a quick find- differentially-expressed-genes analysis with limma. Contradictory to what I would expect I end up with a list of genes. # start example code: # Starting off from ComBat-help (?ComBat) library(sva) library(bladderbatch) data(bladderdata) # Obtain phenotypic data pheno = pData(bladderEset) edata = exprs(bladderEset) batch = pheno$batch mod = model.matrix(~as.factor(cancer), data=pheno) # using dummy data with no effects instead of real data set.seed(100) edata[,] = rnorm(length(edata)) # Correct for batch using ComBat combat_edata = ComBat(dat=edata, batch=batch, mod=mod, par.prior=TRUE, prior.plots=FALSE) # overview of experimental layout with regards to conditions and batches # The "Biopsy" and "Normal" conditions seems to by most unbalanced. table(pheno[,3:4]) # Using the combat corrected dummy data to search for diff genes between "Biopsy"and "Normal" with limma library(limma) contrast = c("Biopsy-Normal") fac=factor(pheno$cancer) design = model.matrix(~0 + fac) colnames(design)=levels(fac) cont.matrix = makeContrasts ( contrasts=contrast, levels=design) # before ComBat fit <- lmFit(edata, design) fit2 = contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2) tab = topTable(fit2, adjust='fdr', coef=contrast, number=999999) # top ten genes tab[1:10,] # no differetially exressed geens found print( paste("DUMMY DATA --- diff genes before batchnorm. Genes with FDR < 0.1: " , sum(tab$adj.P.Val<0.1), sep="")) # ComBat adjusted data fit <- lmFit(combat_edata, design) fit2 = contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2) tab = topTable(fit2, adjust='fdr', coef=contrast, number=999999) # top ten genes tab[1:10,] # Many genes within 0.1 FDR. Looks similar to many microarray results. I would have reported this as a finding. print( paste("DUMMY DATA --- diff genes after batchnorm. Genes with FDR < 0.1: " , sum(tab$adj.P.Val<0.1), sep="")) # end example code I also tried (not shown) to use the real example data with permuted pheno data and run it through limma. The result was no diff-genes before ComBat and a lot after. And with fewer genes and an even more unbalanced setup I managed to make the standard hierarchical cluster heatmap before and after Combat with the samples clustering perfect according to condition afterwards(all with random start data). My interpretation of this is that starting with a unbalanced batch/condition setup like the one in the example (and in my data) and using ComBat will almost ensure that I end up with a list of significantly differentially expressed genes regardless of the real difference. This seems very wrong and I suspect I am missing something important. Please help me. Thanks for your help. Vegard Nygaard Bioinformatics Core Facility Department of Tumor Biology (Montebello) Institute for Cancer Research Oslo University Hospital Phone 0047 22781756 >sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] limma_3.18.5 bladderbatch_1.0.7 Biobase_2.22.0 BiocGenerics_0.8.0 sva_3.8.0 mgcv_1.7-26 nlme_3.1-111 [8] corpcor_1.6.6 BiocInstaller_1.12.0 loaded via a namespace (and not attached): [1] grid_3.0.2 lattice_0.20-24 Matrix_1.0-14 tools_3.0.2

Microarray Clustering Cancer limma Microarray Clustering Cancer limma • 4.2k views

ADD COMMENT • link 12.2 years ago • updated 10.5 years ago Vegard Nygaard ▴ 120

score 4 · Answer 1 · 2015-08-27

Hi again.
I have an update for this post.

I did not get any answers. Since the method was widely used, also
at our institution, we felt forced to figure this out.

Our effort resulted in the recently published article;

"Methods that remove batch effects while retaining group differences may
lead to exaggerated confidence in downstream analyses"
( http://biostatistics.oxfordjournals.org/content/early/2015/08/13/biostatistics.kxv027 ),
Vegard Nygaard, Einar Andreas Rødland, and Eivind Hovig,
Biostat first published online August 13, 2015,
doi:10.1093/biostatistics/kxv027

The (old) sva-package example in the first post will indeed produce unreliable
results. More generally, for unbalanced data sets, there is a problem
with the two-step approach of generating a new batch adjusted dataset and
then disregard the batch info in the subsequent analysis. The
batch-adjusted dataset will still have batch effects (or dependencies),
namely the batch-effect estimation error. And not to account for this in
downstream analysis may in essence be as problematic as disregarding the
original batch effect altogether. Our primary advice for an investigator facing an unbalanced data set with batch effects, would be to account for batch in the statistical analysis. If this is not possible, batch adjustment using outcome as a covariate should only be performed with great caution.

Read our article and the auxiliary information for more.

The example in the sva tutorial has been updated to include the
batch-label in the analysis.

Vegard.