Dear Bioconductor Community,
Im currently analyzing 4 different affymetrix hgu133plus2 datasets, in order to test via machine learning algorithms in R, a 39-gene signature I have already developed in an independent training set, which could discriminate colorectal cancer from control samples. In my latest step, I would like to implement ComBat to correct for batch effect regarding the different dataset-study, and then apply my classifier. In detail, I merged the 4 different datasets (after normalizing separately with frma each)---all having the same affymetrix platform and annotation (also used customCDF annotation), & the same phenotype information i would like to predict in my later step (Status variable below). After merging (as all the datasets have the same affymetrix platform) my code is the following:
total.eset # my merged expressionSet
ExpressionSet (storageMode: environment)
assayData: 19764 features, 142 samples
element names: exprs
protocolData: none
phenoData
rowNames: GSM93789.CEL GSM93920.CEL ... GSM549142.CEL (142 total)
varLabels: Status Dataset
varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: hgu133plus2hsentrezg
## also my phenotype information
head(pData(total.eset))
Status Dataset
GSM93789.CEL Cancer GSE4107
GSM93920.CEL Cancer GSE4107
GSM93921.CEL Cancer GSE4107
GSM93922.CEL Cancer GSE4107
GSM93923.CEL Cancer GSE4107
GSM93924.CEL Cancer GSE4107
where Status is the binary factorial variable i would like to use for the prediction-later classification step, whereas the variable Dataset is the batch covariate. Moreover, there is also an obvious class inbalance regarding the Status information:
table(pData(total.eset)$Status)
Cancer Normal
99 43
pheno <- pData(total.eset)
batch <- total.eset$Dataset
class(batch)
[1] "factor"
levels(batch)
[1] "GSE4107" "GSE4183" "GSE18088"
[4] "Homogenized.GSE21510"
modcombat <- model.matrix(~1, data=pheno)
combat_data <- ComBat(dat=exprs(total.eset), batch=batch, mod=modcombat, par.prior=TRUE, prior.plots=TRUE)
....Found 4 batches
Adjusting for 0 covariate(s) or covariate level(s)
Standardizing Data across genes
Fitting L/S model and finding priors
Finding parametric adjustments
Adjusting the Data.....
exprs(total.eset) <- combat_data
My main issue-consideration, is than i created an MDS plot both before and after ComBat correction, to inspect the severity of batch effect and the possible improvement after. i used the following code for the MDS plot, both before and after correction:
plotMDS(exprs(total.eset), labels=total.eset$Status, col=as.numeric(as.factor(total.eset$Dataset)))
Assuming my above chunk code is correct, do you believe that ComBat correction improves the situation, or it makes things more "fuzzy" ? Based on the two plots ? Moreover, the fact that all probesets are used, might has a role in this ? (i intend to subset the corrected eSet with the 39 gene signature symbols i mentioned above). Or i can proceed to my next step ? Finally, It is important also to mention that one of the 4 datasets has no normal samples, just cancer samples.
My MDS plots, as also an hclust plot are below (the colors in the plots depict the different study-dataset)
https://www.dropbox.com/s/urdbl3d6qe5bq0a/beforeCoMbat.totaleSet.png?dl=0
https://www.dropbox.com/s/t4crrfv0ygjifpb/after.correction.ComBat.MDSplot.png?dl=0
https://www.dropbox.com/s/5hnrs81t628ylxr/hclust.plot.average.beforeComBat.png?dl=0
https://www.dropbox.com/s/k7v8ttbq27pk1xh/prior.plots.ComBat.png?dl=0 # also the plot produced from ComBat
Any help or suggestions on the above issue would be essential !!