Question

Batch correction in Linear Models for methylation data

0

Entering edit mode

lucasmiranda42 • 0

@lucasmiranda42-22517

Last seen 4.4 years ago

Hi! I'm working with a big methylation dataset generated using Illumina 450k arrays, with the intention of running a bunch of linear models on highly variable CpGs including both phenotypic and genotypic factors. The models will be compared afterwards using a metric like AIC or adjusted R2 to yield which predictors explain my response variable the best.

After running the entire QC on the Genetic and Phenotypic data, I run into problems when dealing with Methylation betas due to the presence of batch effects.

To consider:

1) Data arrived already normalized using functional normalization (meffil package) and batch corrected (ComBat function in the sva package). I don't have access to the raw data.

2) They state (and I confirm using a simple ANOVA) that batch effects are still present even after correction. They suggest to include the SLIDE number (the main batch variable) in the linear models.

3) As I'm aiming to compare models using metrics that penalize the number of parameters (and not compare samples for differential methylation), I am worried that having too many dummy variables for batch correction in the models would affect model comparison too much.

The options I could think so far are

A) Control for slide in the linear models as it was stated B) Rerun ComBat correction on a different batch variable (such as Plate, for example, which comprises a bunch of slides) and NOT control for batch effects in the linear models. C) Using surrogate variable analysis to generate a lower number of instances of batches to control for in the linear models.

I already tried A and B and the results are drastically different when comparing models (B was able to remove the batch effects to a point in which I couldn't detect them anymore with a simple ANOVA). I haven't tried C yet. Any hint? Thanks a lot!

methylation ComBat Linear models batch correction AIC • 966 views

ADD COMMENT • link updated 4.4 years ago by Kevin Blighe ★ 3.9k • written 4.4 years ago by lucasmiranda42 • 0

score 0 · Answer 1 · 2019-12-09

Ideally, ComBat would never have been run in the first place, but you indicate that you no longer have access to the data pre-ComBat (?).

What I am thinking is this:

how does the data appear on a PCA bi-plot for PC1 vs PC2, and how much variation is explained by each PC?
how do you define "drastically different" (between A and B)?

For me, the only practical way out of this is via A or C. They should hopefully produce comparable results.

Kevin