Seeking Strategies for Managing Duplicate CpG Entries in EPIC v2 Manifest
1
0
Entering edit mode
@31eec9dd
Last seen 7 weeks ago
United States

Hello everyone,

I've encountered an issue with the EPIC v2 manifest where I found roughly 12,256 duplicate CpG entries in the column labeled "Name." For instance, cg00002033 appears as a duplicate. However, when examining the IlmnID, they are distinctly different (e.g., cg00002033_TC11 vs. cg00002033_TC12). Despite conducting an extensive literature search, I was unable to find any previous discussions or solutions to this problem. This duplication presents challenges in identifying differentially methylated positions (DMPs) accurately.

I wonder if these duplications serve a purpose, such as enhancing technical reliability. I am reaching out to the community for advice or suggestions on how to effectively address this situation. Any insights or guidance would be greatly valued.

Thank you for your help.

DNAMethylation EPICv2 IlluminaHumanMethylationEPICmanifest • 390 views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

Read the manifest release notes that come with the manifest file.

0
Entering edit mode

Thank you James for sharing this file.

I checked the beta value for each replicate, TC11 and TC12; they have a 0.15 difference. In terms of DMPs, which one should I consider, since both are represented by one CpG? I looked into the literature but did not find anything. I am thinking of taking the average, but is there any technique or common practice mentioned in paper that has dealt with IlmnID.

ADD REPLY
0
Entering edit mode

I usually don't bother worrying about such things. Since the beginning of the microarray era there have always been probes that may or may not measure the same thing. Affymetrix arrays had multiple probesets that might have measured the same gene (or not, who knows), and now the EPIC v2 has probes that measure either strand (from different directions) and there are duplicates for some as well, which presumably are longer or shorter or whatever.

In terms of DMPs, why do you have to choose one to consider? Let's say there are four that measure a given CpG and one has a p<1e-8 and the others have larger p-values. You could either compute the mean of the methylation values and likely destroy any possible signal, or select the probe with the lowest p-value (both choices are part of the sesame package btw, see ?getBetas for that package). If you ignore the duplication, you will end up selecting any probe with a p-value less than your cutoff, and maybe there will be duplicates.

But in both cases (taking the mean, or choosing the CpG with the smallest p) you are choosing, in bulk, what should happen with the N duplicated probes for each of the 12K probes with duplicates. I have no idea what is the 'right thing' to do in a given situation, let alone 12K situations that I haven't explored in any depth. I do the same thing I always did with older microarrays, which is to ignore the duplication and just go forward. If you end up with a bunch of duplicate probes for a given CpG that are all significant, that might mean the signal is better in some sense, but I don't see a particular problem in that situation.

ADD REPLY
0
Entering edit mode

Here's an evaluation paper, where they talk a bit about what to do with the duplicates.

ADD REPLY

Login before adding your answer.

Traffic: 547 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6