Hello everyone,
I've encountered an issue with the EPIC v2 manifest where I found roughly 12,256 duplicate CpG entries in the column labeled "Name." For instance, cg00002033 appears as a duplicate. However, when examining the IlmnID, they are distinctly different (e.g., cg00002033_TC11 vs. cg00002033_TC12). Despite conducting an extensive literature search, I was unable to find any previous discussions or solutions to this problem. This duplication presents challenges in identifying differentially methylated positions (DMPs) accurately.
I wonder if these duplications serve a purpose, such as enhancing technical reliability. I am reaching out to the community for advice or suggestions on how to effectively address this situation. Any insights or guidance would be greatly valued.
Thank you for your help.
Thank you James for sharing this file.
I checked the beta value for each replicate, TC11 and TC12; they have a 0.15 difference. In terms of DMPs, which one should I consider, since both are represented by one CpG? I looked into the literature but did not find anything. I am thinking of taking the average, but is there any technique or common practice mentioned in paper that has dealt with IlmnID.
I usually don't bother worrying about such things. Since the beginning of the microarray era there have always been probes that may or may not measure the same thing. Affymetrix arrays had multiple probesets that might have measured the same gene (or not, who knows), and now the EPIC v2 has probes that measure either strand (from different directions) and there are duplicates for some as well, which presumably are longer or shorter or whatever.
In terms of DMPs, why do you have to choose one to consider? Let's say there are four that measure a given CpG and one has a p<1e-8 and the others have larger p-values. You could either compute the mean of the methylation values and likely destroy any possible signal, or select the probe with the lowest p-value (both choices are part of the
sesame
package btw, see ?getBetas for that package). If you ignore the duplication, you will end up selecting any probe with a p-value less than your cutoff, and maybe there will be duplicates.But in both cases (taking the mean, or choosing the CpG with the smallest p) you are choosing, in bulk, what should happen with the N duplicated probes for each of the 12K probes with duplicates. I have no idea what is the 'right thing' to do in a given situation, let alone 12K situations that I haven't explored in any depth. I do the same thing I always did with older microarrays, which is to ignore the duplication and just go forward. If you end up with a bunch of duplicate probes for a given CpG that are all significant, that might mean the signal is better in some sense, but I don't see a particular problem in that situation.
Here's an evaluation paper, where they talk a bit about what to do with the duplicates.