**570**wrote:

Hi,

I'm working on a differential methylation analysis on WGBS data, using the bsseq package.

I'm following the vignette to identify differentially methylated regions (DMRs) between two groups but I'm having troubles in understanding this step:

Once t-statistics have been computed, we can compute differentially methylated regions (DMRs) by thresholding the t-statistics. Here we use a cutoff of 4.6, which was chosen by looking at the quantiles of the t-statistics (for the entire genome). > dmrs0 <- dmrFinder(BS.cancer.ex.tstat, cutoff = c(-4.6, 4.6))

So, my questions are:

- How was the 4.6 cutoff for the t-statitistic chosen and what does it mean to look at the quantiles of the t-statistic? Can you provide some code that shows how the choice was made?

- Would it be possible to filter by p-value instead? I would find it more intuitive but I don't see a p-value column in the results object. How would I go about adding a p-value column based on the values of the t-statistic?

Thank you!

Enrico

**51k**• written 3.7 years ago by enricoferrero •

**570**

The 4.6 cutoff was picked based on a quantile of the cancer dataset (using all autosomes). I have since used the 4.6 cutoff routinely for other systems. The problem I have with a quantile cutoff is that you're implicitely assuming that a certain percentage of the genome is differentially methylation, which is often an open question. In my experience 4.6 gives you good looking DMRs and decreasing this cutoff substantially just gives a lot of bad looking regions. In systems with few differences I don't find many DMRs with a 4.6 cutoff, but that is because there is not much difference between the two groups, not because the cutoff is too stringent.

The best way to do this is to pair the choice of cutoff with a permutation analysis where you permute the sample labels. For most samples sizes using WGBS this sounds a bit insane, but we have done this successfully. For example, in our paper on EBV in Genome Research we do use a permutation approach despite the fact that we only have 3 samples in each group. But in this case the signal is so strong that we for almost all the DMRs, we do not find any DMR in any permutation which is better.

6.4kThanks for clarifying Kasper - very helpful. I will check what results I get with both the 2.5/97.5 percentiles and your empirical 4.6 cutoff.

Probably a naive question (as you might have already realised I don't have a stats background) but wouldn't you solve this problem at least in part by selecting DMRs based on their p-value adjusted for multiple testing? In that way you would not make the assumption that a certain percentage of the genome is necessarily differentially methylated and you could use a p-value threshold that, albeit still arbitrary, is easier to explain/justify in most contexts.

Back to Peter's answer below, if I were to compute p-values based on the t-statistics with

`pt()`

, how would I calculate the degrees of freedom?5700I ended up using the 4.6 cutoff as it's considerably more stringent than the 2.5/97.5 percentiles.

570