Inspection of batch effect correction success with ComBat in R, regarding a merged dataset comprised of 4 different batches
Entering edit mode
svlachavas ▴ 800
Last seen 5 days ago
Germany/Heidelberg/German Cancer Resear…

Dear Bioconductor Community,

Im currently analyzing 4 different affymetrix hgu133plus2 datasets, in order to test via machine learning algorithms in R, a 39-gene signature I have already developed in an independent training set, which could discriminate colorectal cancer from control samples. In my latest step, I would like to implement ComBat to correct for batch effect regarding the different dataset-study, and then apply my classifier. In detail, I merged the 4 different datasets (after normalizing separately with frma each)---all having the same affymetrix platform and annotation (also used customCDF annotation), & the same phenotype information i would like to predict in my later step (Status variable below). After merging (as all the datasets have the same affymetrix platform) my code is the following:

total.eset # my merged expressionSet
ExpressionSet (storageMode: environment)
assayData: 19764 features, 142 samples 
  element names: exprs 
protocolData: none
  rowNames: GSM93789.CEL GSM93920.CEL ... GSM549142.CEL (142 total)
  varLabels: Status Dataset
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: hgu133plus2hsentrezg 

## also my phenotype information

                         Status Dataset
GSM93789.CEL Cancer GSE4107
GSM93920.CEL Cancer GSE4107
GSM93921.CEL Cancer GSE4107
GSM93922.CEL Cancer GSE4107
GSM93923.CEL Cancer GSE4107
GSM93924.CEL Cancer GSE4107

where Status is the binary factorial variable i would like to use for the prediction-later classification step, whereas the variable Dataset is the batch covariate. Moreover, there is also an obvious class inbalance regarding the Status information:


Cancer Normal 
    99     43 

pheno <- pData(total.eset)
batch <- total.eset$Dataset

[1] "factor"

[1] "GSE4107"              "GSE4183"              "GSE18088"            
[4] "Homogenized.GSE21510"

modcombat <- model.matrix(~1, data=pheno)
combat_data <- ComBat(dat=exprs(total.eset), batch=batch, mod=modcombat, par.prior=TRUE, prior.plots=TRUE) 

....Found 4 batches
Adjusting for 0 covariate(s) or covariate level(s)
Standardizing Data across genes
Fitting L/S model and finding priors
Finding parametric adjustments
Adjusting the Data.....

exprs(total.eset) <- combat_data

My main issue-consideration, is than i created an MDS plot both before and after ComBat correction, to inspect the severity of batch effect and the possible improvement after. i used the following code for the MDS plot, both before and after correction:

plotMDS(exprs(total.eset), labels=total.eset$Status, col=as.numeric(as.factor(total.eset$Dataset)))

Assuming my above chunk code is correct, do you believe that ComBat correction improves the situation, or it makes things more "fuzzy" ? Based on the two plots ? Moreover, the fact that all probesets are used, might has a role in this ? (i intend to subset the corrected eSet with the 39 gene signature symbols i mentioned above). Or i can proceed to my next step ? Finally, It is important also to mention that one of the 4 datasets has no normal samples, just cancer samples.

My MDS plots, as also an hclust plot are below (the colors in the plots depict the different study-dataset) # also the plot produced from ComBat

Any help or suggestions on the above issue would be essential !!


affymetrix microarrays sva combat sva batch effect plotMDS • 1.1k views
Entering edit mode
Jeff Leek ▴ 640
Last seen 19 months ago
United States


It looks like Combat may have removed some of the batch effect, but it is pretty hard to detect from these plots. You might consider checking out the BatchQC package:

To evaluate your batch correction. 





Login before adding your answer.

Traffic: 365 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6