Question

Inspection of batch effect correction success with ComBat in R, regarding a merged dataset comprised of 4 different batches

0

Entering edit mode

svlachavas ▴ 840

@svlachavas-7225

Last seen 2 days ago

Germany/Heidelberg/German Cancer Resear…

Dear Bioconductor Community,

Im currently analyzing 4 different affymetrix hgu133plus2 datasets, in order to test via machine learning algorithms in R, a 39-gene signature I have already developed in an independent training set, which could discriminate colorectal cancer from control samples. In my latest step, I would like to implement ComBat to correct for batch effect regarding the different dataset-study, and then apply my classifier. In detail, I merged the 4 different datasets (after normalizing separately with frma each)---all having the same affymetrix platform and annotation (also used customCDF annotation), & the same phenotype information i would like to predict in my later step (Status variable below). After merging (as all the datasets have the same affymetrix platform) my code is the following:

total.eset # my merged expressionSet ExpressionSet (storageMode: environment) assayData: 19764 features, 142 samples element names: exprs protocolData: none phenoData rowNames: GSM93789.CEL GSM93920.CEL ... GSM549142.CEL (142 total) varLabels: Status Dataset varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' Annotation: hgu133plus2hsentrezg

## also my phenotype information

head(pData(total.eset))
Status Dataset
GSM93789.CEL Cancer GSE4107
GSM93920.CEL Cancer GSE4107
GSM93921.CEL Cancer GSE4107
GSM93922.CEL Cancer GSE4107
GSM93923.CEL Cancer GSE4107
GSM93924.CEL Cancer GSE4107

where Status is the binary factorial variable i would like to use for the prediction-later classification step, whereas the variable Dataset is the batch covariate. Moreover, there is also an obvious class inbalance regarding the Status information:

table(pData(total.eset)$Status)

Cancer Normal 99 43

pheno <- pData(total.eset) batch <- total.eset$Dataset

class(batch) [1] "factor"

levels(batch) [1] "GSE4107" "GSE4183" "GSE18088" [4] "Homogenized.GSE21510"

modcombat <- model.matrix(~1, data=pheno) combat_data <- ComBat(dat=exprs(total.eset), batch=batch, mod=modcombat, par.prior=TRUE, prior.plots=TRUE)

....Found 4 batches
Adjusting for 0 covariate(s) or covariate level(s)
Standardizing Data across genes
Fitting L/S model and finding priors
Finding parametric adjustments
Adjusting the Data.....

exprs(total.eset) <- combat_data

My main issue-consideration, is than i created an MDS plot both before and after ComBat correction, to inspect the severity of batch effect and the possible improvement after. i used the following code for the MDS plot, both before and after correction:

plotMDS(exprs(total.eset), labels=total.eset$Status, col=as.numeric(as.factor(total.eset$Dataset)))

Assuming my above chunk code is correct, do you believe that ComBat correction improves the situation, or it makes things more "fuzzy" ? Based on the two plots ? Moreover, the fact that all probesets are used, might has a role in this ? (i intend to subset the corrected eSet with the 39 gene signature symbols i mentioned above). Or i can proceed to my next step ? Finally, It is important also to mention that one of the 4 datasets has no normal samples, just cancer samples.

My MDS plots, as also an hclust plot are below (the colors in the plots depict the different study-dataset)

https://www.dropbox.com/s/urdbl3d6qe5bq0a/beforeCoMbat.totaleSet.png?dl=0

https://www.dropbox.com/s/t4crrfv0ygjifpb/after.correction.ComBat.MDSplot.png?dl=0

https://www.dropbox.com/s/5hnrs81t628ylxr/hclust.plot.average.beforeComBat.png?dl=0

https://www.dropbox.com/s/k7v8ttbq27pk1xh/prior.plots.ComBat.png?dl=0 # also the plot produced from ComBat

Any help or suggestions on the above issue would be essential !!

affymetrix microarrays sva combat sva batch effect plotMDS • 1.9k views

ADD COMMENT • link updated 8.6 years ago by Jeff Leek ▴ 650 • written 8.6 years ago by svlachavas ▴ 840

score 1 · Answer 1 · 2016-08-31

1

Entering edit mode

Jeff Leek ▴ 650

@jeff-leek-5015

Last seen 4.1 years ago

United States

Hi

It looks like Combat may have removed some of the batch effect, but it is pretty hard to detect from these plots. You might consider checking out the BatchQC package:

http://watson.nci.nih.gov/bioc_mirror/packages/release/bioc/html/BatchQC.html

To evaluate your batch correction.

Best

Jeff

ADD COMMENT • link 8.6 years ago Jeff Leek ▴ 650