Question

t-statistic cutoff for finding DMRs with bsseq

0

Entering edit mode

enricoferrero ▴ 660

@enricoferrero-6037

Last seen 2.4 years ago

Switzerland

Hi,

I'm working on a differential methylation analysis on WGBS data, using the bsseq package.

I'm following the vignette to identify differentially methylated regions (DMRs) between two groups but I'm having troubles in understanding this step:

Once t-statistics have been computed, we can compute differentially methylated regions (DMRs) by thresholding the t-statistics. Here we use a cutoff of 4.6, which was chosen by looking at the quantiles of the t-statistics (for the entire genome).
> dmrs0 <- dmrFinder(BS.cancer.ex.tstat, cutoff = c(-4.6, 4.6))

So, my questions are:

How was the 4.6 cutoff for the t-statitistic chosen and what does it mean to look at the quantiles of the t-statistic? Can you provide some code that shows how the choice was made?
Would it be possible to filter by p-value instead? I would find it more intuitive but I don't see a p-value column in the results object. How would I go about adding a p-value column based on the values of the t-statistic?

Thank you!

Enrico

bsseq dmr ttest pvalue • 3.5k views

ADD COMMENT • link updated 8.2 years ago by James W. MacDonald 65k • written 8.2 years ago by enricoferrero ▴ 660

1

Entering edit mode

The 4.6 cutoff was picked based on a quantile of the cancer dataset (using all autosomes). I have since used the 4.6 cutoff routinely for other systems. The problem I have with a quantile cutoff is that you're implicitely assuming that a certain percentage of the genome is differentially methylation, which is often an open question. In my experience 4.6 gives you good looking DMRs and decreasing this cutoff substantially just gives a lot of bad looking regions. In systems with few differences I don't find many DMRs with a 4.6 cutoff, but that is because there is not much difference between the two groups, not because the cutoff is too stringent.

The best way to do this is to pair the choice of cutoff with a permutation analysis where you permute the sample labels. For most samples sizes using WGBS this sounds a bit insane, but we have done this successfully. For example, in our paper on EBV in Genome Research we do use a permutation approach despite the fact that we only have 3 samples in each group. But in this case the signal is so strong that we for almost all the DMRs, we do not find any DMR in any permutation which is better.

ADD REPLY • link 8.2 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

Thanks for clarifying Kasper - very helpful. I will check what results I get with both the 2.5/97.5 percentiles and your empirical 4.6 cutoff.

Probably a naive question (as you might have already realised I don't have a stats background) but wouldn't you solve this problem at least in part by selecting DMRs based on their p-value adjusted for multiple testing? In that way you would not make the assumption that a certain percentage of the genome is necessarily differentially methylated and you could use a p-value threshold that, albeit still arbitrary, is easier to explain/justify in most contexts.

Back to Peter's answer below, if I were to compute p-values based on the t-statistics with pt(), how would I calculate the degrees of freedom?

ADD REPLY • link 8.2 years ago enricoferrero ▴ 660

0

Entering edit mode

@enricoferrero: I'm interested in the results you found. Any significant difference in performance?

ADD REPLY • link 8.0 years ago Stephen H • 0

0

Entering edit mode

I ended up using the 4.6 cutoff as it's considerably more stringent than the 2.5/97.5 percentiles.

ADD REPLY • link 8.0 years ago enricoferrero ▴ 660

score 2 · Accepted Answer · 2016-02-12

2

Entering edit mode

Peter Hickey ▴ 740

@petehaitch

Last seen 4 hours ago

WEHI, Melbourne, Australia

Hi Enrico,

The cutoff is chosen as follows: Compute the t-statistic for each CpG in your study and then form the distribution of these t-statistics. Compute the (alpha/2)-level lower and upper quantiles of this distribution; this can be done using the quantile() function in R. The choice of alpha is up to you - from memory, the numbers referred to in the vignette correspond to alpha = 0.05, i.e., the lower quantile is at the lower 2.5% of the data and the upper quantile as at the upper 97.5% of the data. This means we are selecting empirically the 5% of the most extreme t-statistics in your data. To emphasise, these numbers should be based on the distribution of t-statistics from your data.
Sure, you could filter by p-value. You can easily map the t-statistic to a p-value; see the pt() function in R. However, you will need to also compute and supply the appropriate degrees of freedom as well as accounting for the sign of the t-statistic. To me, this makes it simpler to just use the t-statistics themselves for setting the cutoff.

I hope this helps.

By the way, there's no need to notify the bsseq developers via Twitter of your question on this forum. We receive email notifications of posts tagged by packages we are involved with.

Cheers,

Pete

ADD COMMENT • link 8.2 years ago Peter Hickey ▴ 740

1

Entering edit mode

WRT to Twitter mentions: It's good to know that you get notified through this forum, but I think that tweeting about it can stimulate engagement with the wider rstats/bioinformatics community, which is the primary reason why I do that. Having said that, if you found the tweet mention annoying I apologise.

Thanks again for your help!

Enrico

ADD REPLY • link 8.2 years ago enricoferrero ▴ 660

0

Entering edit mode

Hi Peter,

Many thanks for the explanation - it is clearer now.

It might be worth adding some extra info to the vignette as it's not immediately obvious how the cutoff is calculated and with what percentiles. From what I can see, 4.6 does not correspond to specific percentiles on the vignette data.

Cheers,

Enrico

ADD REPLY • link 8.2 years ago enricoferrero ▴ 660

1

Entering edit mode

I've logged an issue to do just that https://github.com/kasperdanielhansen/bsseq/issues/12. Thanks, Enrico, I'll look into it before the next release.

ADD REPLY • link 8.2 years ago Peter Hickey ▴ 740

score 2 · Accepted Answer · 2016-02-12

Edit: ~~Technically the quantiles are the (0,25,50,75,100) percentiles of the data~~ <- That is like, not correct.

Quantiles can be computed using the quantile function. Using the data from the vignette:

> library(bsseqData)

> data(BS.cancer.ex.tstat)

> ps <- getStats(BS.cancer.ex.tstat)[,"tstat.corrected"]

> quantile(ps)
         0%         25%         50%         75%        100%
-10.9073281  -0.5864714   0.1185087   0.8609352  14.3986699

But I doubt that you want to use any of those percentiles as a cutoff. More likely you want to look at the (95, 99, 99.5) percentiles instead.

> quantile(ps, c(0,0.005, 0.01, 0.05, 0.95,0.99, 0.995,1))
        0%       0.5%         1%         5%        95%        99%      99.5%
-10.907328  -4.115913  -3.447037  -1.998064   2.622552   5.918016   8.127983
      100%
 14.398670

And you could exclude all but about 2% of the regions under contention by using the cutoffs that are used in the vignette

> sum(ps < -4.6 | ps > 4.6)/length(ps)
[1] 0.01981683

You could use the p-value for those t-statistics, but it wouldn't be materially different than using the t-statistics directly.

The idea here is pretty simple - you just want to get a relatively small number of CpG sites that look to be differentially methylated, and then see if any of them are all together in one or more regions. How you select the CpG sites will necessarily be pretty ad hoc, and you just want to make sure you don't have too many (don't want too many false positives creeping in) or too few (you do want some results, no?). In the end you need some criterion to say if a CpG is in or out, and using the t-stat or the p-value will result in pretty much the same thing.

To see what I mean, you can make a 'volcano plot' of the data:

> plot(ps, -log10(pt(abs(ps), 8, lower.tail = FALSE)))

> abline(v = c(-4.6, 4.6))

The cutoff in the vignette is choosing all the CpG sites outside of those two vertical lines. Using a p-value cutoff would be the same as putting a horizontal line in there somewhere, at say

> abline(h = 3.1)

And taking all the CpG sites above that line. Which just so happens to be the same exact CpG sites.