Batch correction in Linear Models for methylation data
Entering edit mode
Last seen 3.8 years ago

Hi! I'm working with a big methylation dataset generated using Illumina 450k arrays, with the intention of running a bunch of linear models on highly variable CpGs including both phenotypic and genotypic factors. The models will be compared afterwards using a metric like AIC or adjusted R2 to yield which predictors explain my response variable the best.

After running the entire QC on the Genetic and Phenotypic data, I run into problems when dealing with Methylation betas due to the presence of batch effects.

To consider:

1) Data arrived already normalized using functional normalization (meffil package) and batch corrected (ComBat function in the sva package). I don't have access to the raw data.

2) They state (and I confirm using a simple ANOVA) that batch effects are still present even after correction. They suggest to include the SLIDE number (the main batch variable) in the linear models.

3) As I'm aiming to compare models using metrics that penalize the number of parameters (and not compare samples for differential methylation), I am worried that having too many dummy variables for batch correction in the models would affect model comparison too much.

The options I could think so far are

A) Control for slide in the linear models as it was stated B) Rerun ComBat correction on a different batch variable (such as Plate, for example, which comprises a bunch of slides) and NOT control for batch effects in the linear models. C) Using surrogate variable analysis to generate a lower number of instances of batches to control for in the linear models.

I already tried A and B and the results are drastically different when comparing models (B was able to remove the batch effects to a point in which I couldn't detect them anymore with a simple ANOVA). I haven't tried C yet. Any hint? Thanks a lot!

methylation ComBat Linear models batch correction AIC • 699 views
Entering edit mode
Last seen 7 minutes ago
Republic of Ireland

Ideally, ComBat would never have been run in the first place, but you indicate that you no longer have access to the data pre-ComBat (?).

What I am thinking is this:

  • how does the data appear on a PCA bi-plot for PC1 vs PC2, and how much variation is explained by each PC?
  • how do you define "drastically different" (between A and B)?

For me, the only practical way out of this is via A or C. They should hopefully produce comparable results.



Login before adding your answer.

Traffic: 732 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6