Differential methylation probes were much fewer than differential methylation genes
1
0
Entering edit mode
@yuabrahamliu-17670
Last seen 3 months ago

Hi everyone,

I have a question on differential DNA methylation probes and genes. Hope somebody could give some guidacne. I am working on an EPIC DNA methylation dataset with more than 200 samples (half control samples, and half disease samples), with other confounding information included, like various cell contents (cell type1 to cell type3) of the samples, and other factors like donator's age, gender, BMI and so on. Then I did differential methylation analysis on two levels, one was on EPIC methylation probe level (use their beta values), and the other was on gene level (use the average beta value of probes in gene promotor region to represent gene methylation level). For both probes and genes, I used limma to find differential ones with an regression model like,

probe/gene beta value ~ Disease_status + Cell_type1_content + Cell_type2_content + Cell_type3_content + Disease_status:Cell_type1_content + Diease_status:Cell_type2_content + Diease_status:Cell_type3_content + age + gender + BMI + ... (other confounding factors)


Then, I checked the adjusted p-value of the interaction terms like, Diesease_status:Cell_type1_content. It inidcated whether the correlation between probe/gene and Cell_type1_content, was significantly different between control group and disease group. However, what I found wired was that, on gene level, more than 3000 such differential genes could be found for the interaction term Disease_status:Cell_type1_content, while on probe level, only less than 500 probes could be found, which was much fewer than the genes. Since gene methylation value was summaried from probe value, as mentioned above, it was unreasonable that its number was much larger than that of probe. I don't know whether it is the problem of my limma regression design, or gene beta value summarization. Hope somebody could give some suggestions. Thank you so much!

methylation limma • 224 views
0
Entering edit mode
@james-w-macdonald-5106
Last seen 2 hours ago
United States

You shouldn't use beta values with software that expects 'normal-ish' data. Given the number of samples you could assume the CLT has long since kicked in, but it's just as easy to use M-values, which don't have the issues that the beta values have.

As for your main question, when you take the average of a bunch of probes (which doesn't really make sense for beta values, but I digress) you are artificially reducing the variance of your observations, which may well cause you to have more 'significant' results. In other words, the average of the probes will be much less variable than the individual probes themselves. Since the denominator of your statistic is a measure of that variance, the averaged probes are much more likely to be significant. That's not to say the results are correct, but just an observation that taking averages like that will tend to be anti-conservative.

It could also be considered cheating, and is why none of the methods for aggregating signal across genomically adjacent regions do that. You might consider using published methods like DMRcate or minfi (although with that number of samples, minfi seems prohibitively slow).

1
Entering edit mode

Thanks for answering a limma question. I agree with most of your post, but I don't see any reason to think that taking averages would be anti-conservative or would be at all equivalent to cheating. I agree that averages will be less noisy, but that is a good thing. IMO an analysis of averages for gene promoters should control the error rate at least as well as published methods for aggregating adjacent regions implemented in packages such as DMRcate or minfi. I agree that M-values would probably be better than beta-values, but I have little experience with methylation arrays.

In general, it is always valid to compute summary statistics (such as an averages) and to conduct a statistical analysis of the averages. The only requirement is that the values to be averaged should be chosen purely according to gene annotation and not according to the data or the size of the beta or M-values. Ideally, the averages should be taken over similar numbers of individual observations, but statistical analyses are usually fairly robust to variations in n.

The reason why DMRcate, minfi etc don't take genewise averages is that they want to conduct analyses that are annotation agnostic, and they want to detect methylation changes anywhere in the genome, not just within promoters. It is a question of what scientific question one wants to answer rather than a question of statistical validity. The approach they take is far more delicate statistically because they are in fact choosing regions to be aggregated on the basis of the observed M-values and therefore they have to adjust for multiple testing at the observation level as well at the region level. Aaron Lun and I explored similar issues in the csaw package for ChIP-seq. These issues don't arise for gene promoter analyses or for any analysis where the genomic intervals are preset.

I am a big fan of gene-orientated analyses that focus on the main genewise signal rather than trying to detect every possible change anywhere in the genome in a single analysis, for example:

0
Entering edit mode