Question

Significant DMRs in genes with no single significant CpG using limma

0

Entering edit mode

anne-kristin.stavrum • 0

@anne-kristinstavrum-13056

Last seen 6.2 years ago

I have analysed my data (from the Epic Illumina array), using limma to find single differentially methylated CpGs (DMPs), and DMRcate to find DMRs. The data and the model is exactly the same for both analyses; using no intercept and specifying a contrast of interest.

When I compare the lists of genes that the significant DMRs and DMPs map to, there is of course an overlap, but more than half of the genes with a significant DMR do not show up on the list I get using limma. When I check all CpGs mapping to these genes (not just the ones in the DMR), then none of them show up on the list of significant DMPs. I understand that due to the smoothing function, some DMRs may stretch into the promoter region of a gene, hence it is possible to get genes on the list of DMRs that do not appear on the list from limma, but I have examples where the DMR is the exon of a gene. In this case I would expect that at least one of the CpGs would show up on the list of significant DMPs from limma, since I thought at least one CpG within the DMR would have to be individually significant. Is this not the case?

My question is therefore why this happens? Does the smoothing function of DMRcate give me lots of false positives?

Kind regards, Anne-Kristin

DMRcate • 3.0k views

ADD COMMENT • link written 6.4 years ago by anne-kristin.stavrum • 0

score 0 · Answer 1 · 2019-08-17

Hi Ann-Kristin,

In this case I would expect that at least one of the CpGs would show up on the list of significant DMPs from limma, since I thought at least one CpG within the DMR would have to be individually significant. Is this not the case?

No, this is not necessarily the case. DMRcate does not define DMRs on the basis of DMPs themselves, only that the FDR threshold used to define them is indexed at the rate of that of DMPs, at whatever rate you specify in cpg.annotate(). Depending on how the limma t-statistics are spatially distributed, it is very likely you'll get at least some DMRs that contain no DMPs, and DMPs that are not constitutive of DMRs.

My question is therefore why this happens?

DMRcate considers all CpGs when smoothing, not just the DMPs. So a contiguous critical mass of CpGs all with a modest effect that is nevertheless just below the DMP FDR threshold will be aggregated to a point more significant than, say, a group of CpGs where only 1 or 2 are significant and the rest not at all. In fact, the former type of DMR will be reported at the expense of the latter.

Does the smoothing function of DMRcate give me lots of false positives?

Great question, and the point at which the user has to make a judgement call. The post-smoothing per-CpG FDRs (the minimum of which is reported in minfdr in your results GRanges object) are much more permissive than those from limma, and so rather than set the default threshold statically on these, the recommended default (pcutoff = "fdr" in dmrcate()) dynamically adjusts the final threshold to set the number of constituent CpGs to be the same as the number of DMPs found by limma at that FDR (as I alluded to in the first paragraph). This leads to a situation where there are, in-effect, two equally-sized lists of CpGs that are (most likely) non-identical. DMRs are then aggregated from the post-smoothed list, not from the DMPs. This is an inherently conservative approach, since all CpGs are assumed to be independent (even though we know they are not), so false positives shouldn't be a concern if you are using the default settings. However, if you feel this is too conservative, please relax the fdr argument in cpg.annotate() to your liking. To offset this, you can perhaps use the Stouffer value for each DMR to further refine your list, but again this is up to you.

Best, Tim