Question

Two EPIC Methylation Array Analysis Questions

0

Entering edit mode

phelankj • 0

@c526a81b

Last seen 3 months ago

United States

Hello,

I had two questions regarding EPIC methylation array data analysis. I am using the minfi package for analysis, have removed low quality samples, processed the data with quantile normalization, and extracted both beta and m values.

When examining MDS (or PCA) plots to examine sources of variation in the data, I see a large batch effect along PC2 corresponding to the plate in which the sample was sequenced, which I expected. However, there is a large unexplained difference between samples along the first principal component which is present in both plates. It does not correlate with position on the chip, and it does not have an association with age, sex, race, or biological condition. It accounts for a huge amount of variance (79%), so I was wondering if there are any common technical issues that people adjust for on the front end or if anyone has seen this discrepancy before.

enter image description here

Is linear regression (as in limma package) an appropriate model to use to determine differentially methylated CpGs/regions? The beta and M values have bimodal distributions which violates the normality assumption, I wasn't sure if another transformation was needed in order to achieve a normal distribution.

Thank you in advance for the help.

minfiDataEPIC minfi • 782 views

ADD COMMENT • link updated 19 months ago by James W. MacDonald 68k • written 19 months ago by phelankj • 0

score 0 · Answer 1 · 2023-08-28

It's common to use limma to compare samples using the M-values. The distribution you are talking about is between-CpG distribution on an array, which is orthogonal to the distribution you care about. In other words, consider that your data are CpGs in rows, and samples in columns. The bimodal distribution is what you get when you plot the distribution of the columns. But the comparisons you will be making are the rows (e.g., you compare the same CpG in different samples, not different CpGs in the same sample).

As for the PCA, you don't have a large effect due to batch - it's only 7%. But you have a massive effect (almost 80% of the variation!) that you cannot explain? That's suboptimal. But then I don't use quantile normalization. You might try preprocessFunnorm, which is my personal go-to for methylation arrays.