Question

Batch effect correction: methylome data

0

Entering edit mode

Seungmin • 0

@4e4c1579

Last seen 19 months ago

South Korea

Hi,

I have recently analyzed two EPIC v.1(850K) datasets and two EPIC v.2(900K) datasets. EPIC v.1 and v.2 datasets have different IDAT file sizes and types so, I analyzed separately using SeSaMe package.

After preprocessing, when overlapping probes are plotted in a PCA plot, there was a significant difference between EPIC v.1 and v.2 datasets. So when batch effect correction(using ComBat package) was performed, some beta values became negative or greater than 1.

In this case, should I just proceed with the analysis as is, or is there another correction method?

Thank you.

sesame BatchEffect MethylationArray • 2.4k views

ADD COMMENT • link updated 20 months ago by zhouwanding ▴ 20 • written 20 months ago by Seungmin • 0

score 0 · Answer 1 · 2024-05-14

Unless you are using DSS, you shouldn't be using Beta values, but instead should be using M-values. It's not clear to me that you can simply combine V1 and V2 EPIC arrays, and I would in general default towards fitting a batch effect as part of the model rather than trying to remove the batch effect using ComBat (even if you use ComBat first, you should fit the batch as part of the model to correctly reduce the available degrees of freedom). However!

I normally think of any high-throughput measures as being somewhat correlated with the underlying thing being measured (in this case methylation), but not an actual measure, as what you are really measuring is the fluorescent intensity of some probes that you tried to bind to some cDNA, with the assumption that the probes will bind better or worse depending on the methylation status of the CpG site. We assume that this binding will be consistent within a given experiment, but there is no reason IMO to assume that the binding will be consistent across different batches of reagents, different chips, etc. In other words, I think it's entirely consistent that you should get weird results if you try to batch correct different array types using ComBat.

Given that assumption, it is probably a more conservative approach to analyze the two datasets separately and then combine the statistics in a meta-analysis, rather than just piling them all together and expecting that the results will be meaningful. You might consider the GeneMeta package, but I imagine there are other packages that would be useful in that context.

score 0 · Answer 2 · 2024-05-14

I agree that the batch variable should be better modeled as a co-variate than explicit batch correction. There is also RUVm (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4652745/) that removes batch effect for PCA analysis.

Also we have curated a list of EPIC-EPICv2-consistent probes based on empirical cell line data, might be useful to you (sesameDataGet("liftOver.EPICv2ToEPIC") and https://github.com/zhou-lab/InfiniumAnnotationV1/blob/main/Anno/EPICv2/EPICv2ToEPIC_conversion.tsv.gz). You can also use them with the mLiftOver function from the latest sesame (https://www.biorxiv.org/content/10.1101/2024.03.18.585415v1)