Dear Bioconductor Community,
Im currently analyzing 4 different affymetrix hgu133plus2 datasets, in order to test via machine learning algorithms in R, a 39-gene signature I have already developed in an independent training set, which could discriminate colorectal cancer from control samples. In my latest step, I would like to implement ComBat to correct for batch effect regarding the different dataset-study, and then apply my classifier. In detail, I merged the 4 different datasets (after normalizing separately with frma each)---all having the same affymetrix platform and annotation (also used customCDF annotation), & the same phenotype information i would like to predict in my later step (Status variable below). After merging (as all the datasets have the same affymetrix platform) my code is the following:
total.eset # my merged expressionSet
ExpressionSet (storageMode: environment)
assayData: 19764 features, 142 samples
element names: exprs
rowNames: GSM93789.CEL GSM93920.CEL ... GSM549142.CEL (142 total)
varLabels: Status Dataset
experimentData: use 'experimentData(object)'
## also my phenotype information
GSM93789.CEL Cancer GSE4107
GSM93920.CEL Cancer GSE4107
GSM93921.CEL Cancer GSE4107
GSM93922.CEL Cancer GSE4107
GSM93923.CEL Cancer GSE4107
GSM93924.CEL Cancer GSE4107
where Status is the binary factorial variable i would like to use for the prediction-later classification step, whereas the variable Dataset is the batch covariate. Moreover, there is also an obvious class inbalance regarding the Status information:
pheno <- pData(total.eset)
batch <- total.eset$Dataset
 "GSE4107" "GSE4183" "GSE18088"
modcombat <- model.matrix(~1, data=pheno)
combat_data <- ComBat(dat=exprs(total.eset), batch=batch, mod=modcombat, par.prior=TRUE, prior.plots=TRUE)
....Found 4 batches
Adjusting for 0 covariate(s) or covariate level(s)
Standardizing Data across genes
Fitting L/S model and finding priors
Finding parametric adjustments
Adjusting the Data.....
exprs(total.eset) <- combat_data
My main issue-consideration, is than i created an MDS plot both before and after ComBat correction, to inspect the severity of batch effect and the possible improvement after. i used the following code for the MDS plot, both before and after correction:
plotMDS(exprs(total.eset), labels=total.eset$Status, col=as.numeric(as.factor(total.eset$Dataset)))
Assuming my above chunk code is correct, do you believe that ComBat correction improves the situation, or it makes things more "fuzzy" ? Based on the two plots ? Moreover, the fact that all probesets are used, might has a role in this ? (i intend to subset the corrected eSet with the 39 gene signature symbols i mentioned above). Or i can proceed to my next step ? Finally, It is important also to mention that one of the 4 datasets has no normal samples, just cancer samples.
My MDS plots, as also an hclust plot are below (the colors in the plots depict the different study-dataset)
https://www.dropbox.com/s/k7v8ttbq27pk1xh/prior.plots.ComBat.png?dl=0 # also the plot produced from ComBat
Any help or suggestions on the above issue would be essential !!