Hi bioconductorlist,
I am trying to combine microarray data from different
experiments(batches) where some of my batches fall completely in one
condition (normal/cancer). For this I try to use ComBat in the sva
package.
After playing around with my data and ComBat it seemed to me that
ComBat might introduce more signal in my data than there really is.
The example given by ComBat-help (?ComBat) has a batch and condition
setup much like my data, so I will use that to demonstrate.
I will substitute the real data with random data ensuring no batch
effect or condition effect. Then I will do a quick find-
differentially-expressed-genes analysis with limma.
Contradictory to what I would expect I end up with a list of genes.
# start example code:
# Starting off from ComBat-help (?ComBat)
library(sva)
library(bladderbatch)
data(bladderdata)
# Obtain phenotypic data
pheno = pData(bladderEset)
edata = exprs(bladderEset)
batch = pheno$batch
mod = model.matrix(~as.factor(cancer), data=pheno)
# using dummy data with no effects instead of real data
set.seed(100)
edata[,] = rnorm(length(edata))
# Correct for batch using ComBat
combat_edata = ComBat(dat=edata, batch=batch, mod=mod, par.prior=TRUE,
prior.plots=FALSE)
# overview of experimental layout with regards to conditions and
batches
# The "Biopsy" and "Normal" conditions seems to by most unbalanced.
table(pheno[,3:4])
# Using the combat corrected dummy data to search for diff genes
between "Biopsy"and "Normal" with limma
library(limma)
contrast = c("Biopsy-Normal")
fac=factor(pheno$cancer)
design = model.matrix(~0 + fac)
colnames(design)=levels(fac)
cont.matrix = makeContrasts ( contrasts=contrast, levels=design)
# before ComBat
fit <- lmFit(edata, design)
fit2 = contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
tab = topTable(fit2, adjust='fdr', coef=contrast, number=999999)
# top ten genes
tab[1:10,]
# no differetially exressed geens found
print( paste("DUMMY DATA --- diff genes before batchnorm. Genes with
FDR < 0.1: " , sum(tab$adj.P.Val<0.1), sep=""))
# ComBat adjusted data
fit <- lmFit(combat_edata, design)
fit2 = contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
tab = topTable(fit2, adjust='fdr', coef=contrast, number=999999)
# top ten genes
tab[1:10,]
# Many genes within 0.1 FDR. Looks similar to many microarray results.
I would have reported this as a finding.
print( paste("DUMMY DATA --- diff genes after batchnorm. Genes with
FDR < 0.1: " , sum(tab$adj.P.Val<0.1), sep=""))
# end example code
I also tried (not shown) to use the real example data with permuted
pheno data and run it through limma. The result was no diff-genes
before ComBat and a lot after.
And with fewer genes and an even more unbalanced setup I managed to
make the standard hierarchical cluster heatmap before and after Combat
with the samples clustering perfect according to condition
afterwards(all with random start data).
My interpretation of this is that starting with a unbalanced
batch/condition setup like the one in the example (and in my data) and
using ComBat will almost ensure that I end up with a list of
significantly differentially expressed genes regardless of the real
difference.
This seems very wrong and I suspect I am missing something important.
Please help me.
Thanks for your help.
Vegard Nygaard
Bioinformatics Core Facility
Department of Tumor Biology (Montebello)
Institute for Cancer Research
Oslo University Hospital
Phone 0047 22781756
>sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets
methods base
other attached packages:
[1] limma_3.18.5 bladderbatch_1.0.7 Biobase_2.22.0
BiocGenerics_0.8.0 sva_3.8.0 mgcv_1.7-26
nlme_3.1-111
[8] corpcor_1.6.6 BiocInstaller_1.12.0
loaded via a namespace (and not attached):
[1] grid_3.0.2 lattice_0.20-24 Matrix_1.0-14 tools_3.0.2
The (old) sva-package example in the first post will indeed produce unreliable
results. More generally, for unbalanced data sets, there is a problem
with the two-step approach of generating a new batch adjusted dataset and
then disregard the batch info in the subsequent analysis. The
batch-adjusted dataset will still have batch effects (or dependencies),
namely the batch-effect estimation error. And not to account for this in
downstream analysis may in essence be as problematic as disregarding the
original batch effect altogether. Our primary advice for an investigator facing an unbalanced data set with batch effects, would be to account for batch in the statistical analysis. If this is not possible, batch adjustment using outcome as a covariate should only be performed with great caution.
Read our article and the auxiliary information for more.
The example in the sva tutorial has been updated to include the
batch-label in the analysis.
thanks for writing this up! very interesting!