Hi! I'm working with a big methylation dataset generated using Illumina 450k arrays, with the intention of running a bunch of linear models on highly variable CpGs including both phenotypic and genotypic factors. The models will be compared afterwards using a metric like AIC or adjusted R2 to yield which predictors explain my response variable the best.
After running the entire QC on the Genetic and Phenotypic data, I run into problems when dealing with Methylation betas due to the presence of batch effects.
1) Data arrived already normalized using functional normalization (meffil package) and batch corrected (ComBat function in the sva package). I don't have access to the raw data.
2) They state (and I confirm using a simple ANOVA) that batch effects are still present even after correction. They suggest to include the SLIDE number (the main batch variable) in the linear models.
3) As I'm aiming to compare models using metrics that penalize the number of parameters (and not compare samples for differential methylation), I am worried that having too many dummy variables for batch correction in the models would affect model comparison too much.
The options I could think so far are
A) Control for slide in the linear models as it was stated B) Rerun ComBat correction on a different batch variable (such as Plate, for example, which comprises a bunch of slides) and NOT control for batch effects in the linear models. C) Using surrogate variable analysis to generate a lower number of instances of batches to control for in the linear models.
I already tried A and B and the results are drastically different when comparing models (B was able to remove the batch effects to a point in which I couldn't detect them anymore with a simple ANOVA). I haven't tried C yet. Any hint? Thanks a lot!