Hi everyone!
I am working with microarray gene expression data to build a centroid based classifier. This is my workflow
1) In order to do this with a valid sample size, I merged multiple datasets from several platforms. Plotting a PCA revealed an obvious dataset effect. After checking that my variable of interest was balanced among subjects, ( as recommended in http://biostatistics.oxfordjournals.org/content/17/1/29) I decided to use ComBat for batch effect correction. Based on the after-ComBat PCA, things worked out great.
2) Then, I randomly split the data into training and test set, looked for differentially expressed genes in the training set using limma (with the model.matrix including my variable of interest plus batch info, as recommended in A: sva::ComBat without covariate of interest? and A: Method for batch correction )
3) The output genes were used to make a centroid classifier with pamr package (http://www.bioconductor.org/packages//2.7/bioc/manuals/pamr/man/pamr.pdf)
#MY PROBLEM IS…
When using ComBat, you can either specify a covariate (i.e. your variable of interest) or not.
If I run ComBat specifying my variable of interest as a covariate in 1) as recommended by ComBat’s authors, the classifier performs perfectly in the test set, with an acceptable number of false positives and false negatives.
However, if I run ComBat without adjusting for any covariates, the classifier sucks.
The problem is that in a "real world sample" my variable of interest will obviously be unknown and I’ll want to predict it with my classifier, so I won’t be able to perform ComBat with that variable as a covariate for adjustment.
So, I don’t know what to do.
Any advice?
Thank you in advance!
Juan