Question: Odd P-value distribution in edgeR on RRBS data
0
5 months ago by
ingerslev0 wrote:

I have been running an analysis of RRBS data using edgeR, following the guide in the edgeR manual. The metadata looks like this:

 File Sample Condition S1_Pre.cov.gz S1 Pre S1_Post.cov.gz S1 Post ... ... ... S9_Pre.cov.gz S9 Pre S9_Post.cov.gz S9 Post

Everything has been done as in the manual, with the exception of the design matrix and contrast which is:

designSL <- model.matrix(~0+Condition + Sample, data=targets)
design <- modelMatrixMeth(designSL)

contr <- makeContrasts(Condition = ConditionPost - ConditionPre, levels=design)

When I plot a histogram of the raw P-values it looks like so:

Two things strike me as odd, the high number of sites with a P-value close to 1, and the overrepresentation of sites with a P-value around 0.2.

The sites with a P-value close to 1 are (almost) 100% or 0% methylated in all samples (they have 0 counts of either methylated or unmethylated C's in almost all samples). I don't know what characterizes the sites with a P-value around 0.2.

I have tried removing sites with a constant methylation level, but that does not solve the issue (most site have at least one observation in at least one sample)

My questions are:

Is it reasonable/safe to remove all sites that have close to 0 or 100% methylation in all samples? What would be a reasonable heuristic?

What can cause a bump around P = 0.2? Can this be remedied somehow?

edger rrbs • 291 views
modified 5 months ago by Gordon Smyth37k • written 5 months ago by ingerslev0

I just realised that filtering out all sites where the average number of methylated or unmethylated reads across all samples is less than one seems to solve both problems. Now the distribution of P-values looks uniform with a slight increase in sites with a low P-value. If anyone has an explanation, suggestions or guidelines you are welcome to post them, but I think this solves the problem.

Have your followed the edgeR User's Guide regarding filtering, as in Section 4.7.3? Or else the "Filter to remove low counts" section of  https://f1000research.com/articles/6-2055/ ?

I used the filtering as in Section 4.7.3 when I wrote this question, I have now also tested the filtering used in f1000 paper and also the filterByExpr function (with the design matrix) of edgeR. The edgeR guide leaves me with 150.000 sites, f1000 with 260.000 and filterByExpr with 592.000 sites. In comparison filtering out sites with almost 100% or 0% methylation removes 10.2 million sites (the histogram was on a subset of sites).

Is there any reason not to use filterByExpr?

Answer: Odd P-value distribution in edgeR on RRBS data
4
5 months ago by
Gordon Smyth37k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth37k wrote:

How to filter methylation data

Filtering for methylation data is different from RNA-seq. The main requirement is that each CpG site should have decent read coverage, otherwise it is impossible to tell whether the site is methylated or not. In the case study in the edgeR User's Guide (Section 4.7) we required that a CpG site should have at least 10 reads (methylated or unmethylated) in every sample. In version 2 of our F1000 article https://f1000research.com/articles/6-2055/ we relaxed this slightly to at least 8 reads in every sample. You have already applied this filtering, so your analysis should be fine.

The edgeR filterByExpr() function is intended for RNA-seq and is not appropriate for methylation data. For methylation data we require that a CpG has minimal coverage for all samples whereas for RNA-seq data a gene needs to expressed only in some samples.

Now I come to your main question. What if a CpG site has plenty of reads but they are always 100% methylated or 0% methylated. Such a CpG is uninteresting from a differential methylation (DM) point of view. It can never be DM. It will have logFC=0 and PValue=1 for all comparisons. Can you filter it?

After some thought, I think that, yes, you can filter CpG sites with 100% or 0% methylation across all samples. It is true that this filtering is dependent on the methylated vs unmethylated factor in the design matrix. Such "peeking" would not be allowed for RNA-seq. However, for methylation data this factor is a structural variable rather than an experimental factor that you will test for. Filtering these sites should not bias the results for the experimental factors, all of which are essentially orthogonal to the methylation factor.

CpG sites with 100% or 0% methylation will be fitted perfectly by edgeR, with fitted values equal to observed counts, and the deviances will be zero. These sites will not contribute to the dispersion estimation so removing them will not bias the empirical Bayes estimation.

However, removing the 100% or 0% methylated sites may make little different to you final results. There is no objective reason why you need to have a uniform p-value histogram. The only gains will be (i) decrease is running time and (ii) a slight increase in statistical power for detecting DM at the remaining sites.

This answers my question, thank you. Can I ask two more questions? Firstly, why is filterByExpr not appropriate, if it is used on the Coverage object from the f1000 paper? The function seems to be doing almost the same. Secondly, how is the model affected by removing sites that are 0% or 100% across almost all samples. My tissue is very homogenous, and 94% of all sites have an average of less than 1 counts across all samples in either the methylated or the unmethylated state, whereas only 50% have exactly 0 counts. This amount to testing either 5.5 mio. sites or a few hundred thousand sites. The decrease in running time is substantial, and to me it seems the power increase should be big as well - but am I somehow biasing the results by excluding these sites?

No, filterByExpr isn't doing the same thing. For methylation data, we need CpGs to have minimal coverage for every sample, otherwise the methylation level isn't even measurable. For RNA-seq data we only need a gene to be expressed in a few samples to be worth keeping.

The CpG numbers you mention seem enormous. Half of all CpGs have no methylation at all yet you still have 5.5 million CpG sites left after filtering for both coverage >= 10 and methylation not all zero? You must have started with a lot of CpG sites. Is this really RRBS data?

I suppose you could filter CpGs for which the total number of methylated counts was very small, say less than 5. I have not tried this myself and I can't tell you confidently it will be fine if you filter more than that.

I will play it safe and stick to your initial suggestion and only filter out sites with 100% or 0% methylation. The data is RRBS, but it is sequenced quite deeply, maybe more than necessary.

Your explanation of filterByExpr made it make sense to me - thanks.

Of course, if you follow our promoter region approach, then it all becomes much simpler.

Answer: Odd P-value distribution in edgeR on RRBS data
1
5 months ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

If you have a site with 100% or 0% methylation across all samples, then getting a p-value of 1 is inevitable; after all, there is literally no difference in methylation status between the conditions. You can also get p-values of 1 if all sites in one condition have no reads for both the methylated and unmethylated states. The log-ratio of methylated:unmethylated is undefined and thus there is no evidence to reject the null hypothesis. So I don't think there's any mathematical error here - that's just the way it is.

One could argue that you should have done a better job of filtering out such sites, but I don't see an easy way of identifying them in a manner that is independent from the model fit. Your proposed solution (removing all-methylated or all-unmethylated sites) indirectly "peeks" at the difference between conditions; filtering of sites on the basis of this information subverts the multiple testing correction and will probably result in more false positives. All-methylated or all-unmethylated sites occur with non-zero probability under the null hypothesis, so removing them will effectively increase the type I error rate across all remaining tests.

In short, I wouldn't worry about it. You can think of the spike at 1 as a discrete approximation to a uniform distribution (hah!). If we were able to get precise continuous values to replace the zeroes, we'd probably have a nice uniform distribution, but because our observations are rounded off to the nearest integer, we're stuck with p-values of 1 for these sites. I wouldn't worry too much about the spike at 0.2 either. Unless you know that the null hypothesis is true for all sites, you can't expect the p-value distribution to be perfectly uniform.

Edit: Given your update, I should mention that the average count is a good approximation to (what is probably) an independent filter statistic when the library sizes are equal. Getting rid of low-abundance sites is not guaranteed to solve the spike-at-1 problem (e.g., what about strongly methylated/unmethylated sites at high coverage?) but it will reduce the number of sites with many zeroes, so that's a start. The disappearance of the bump is harder to explain but I'm not convinced that was really a problem in the first place.

Answer: Odd P-value distribution in edgeR on RRBS data
0
5 months ago by
United States
Peter Langfelder2.1k wrote:

I'll go against Aaron's advice a bit. You can filter by variance (that's not peeking at the differences between conditions), or, maybe better in this case, by median absolute deviation (MAD, R function mad). If you have a variable (site) that is constant except for a few (less than 25%) of samples, mad will be zero. To the best of my knowledge, filtering out such sites will not bias the significance statistics.

In my opinion, variance filtering is always incorrect in conjunction with edgeR, limma or DESeq2. It is true that variance filtering is independent of the explanatory variable, but it biases the variance distribution and hence stuffs up the empirical Bayes dispersion estimation that those packages do and hence does interfere with the statistical testing procedures.

A simple example to back up Gordon's point:

library(edgeR)

group <- gl(2,4)
design <- model.matrix(~group)

set.seed(999)
disp <- 0.1
y <- matrix(rnbinom(80000, mu=100, size=1/disp), ncol=nrow(design), byrow=TRUE)

d <- DGEList(y)
d <- estimateDisp(d, design)
fit <- glmQLFit(d, design, robust=TRUE)
res <- glmQLFTest(fit, coef="group2")

par(mfrow=c(1,2))
hist(res$table$PValue) # Looks uniform.
mean(res$table$PValue <= 0.1) # Controlled below threshold.
## [1] 0.0956

# Filtering for low-deviance sites, analogous to low variance.
keep <- fit$deviance < median(fit$deviance)

# Repeating empirical Bayes with filtered sites:
fit2 <- glmQLFit(d[keep,], design, robust=TRUE)
res2 <- glmQLFTest(fit2, coef="group2")
hist(res2$table$PValue) # Not uniform!
mean(res2$table$PValue <= 0.1) # Above the threshold.
​## [1] 0.1892

You'll notice that I used the deviance instead of the variance, as the former is the analogous metric for GLMs. Computing the variance of the counts doesn't make sense; the linear model would be identity-link rather than log-link, for starters. Looking for NB dispersions near zero isn't helpful either, as either the dispersions are squeezed away from zero (if prior.df!=0) or many useful features get discarded due to imprecise estimation of the dispersions with low replication (if prior.df=0).

In the OP's specific case, to get rid of the problematic sites in the most specific and comprehensive manner, I would fit the null model to each site and discard all sites that had deviances of zero under the null. However, I would rather not do this in an actual analysis. I suspect this is equivalent to data dredging, though loss of type I error control is difficult to demonstrate; in scenarios where these problematic sites are generated, the p-value distributions are usually quite screwed up in other respects.

I am not going to disagree with Gordon since he knows infinitely more than I do about the internal workings of the empirical Bayes moderation, but I have a comment on Aaron's code.

When filtering by deviance, one needs to calculate the deviance from the fit to design ~1 (which is analogous to calculating variance of the original variable), not ~group (that's biased since you are looking at the response and would be analogous to looking at the variance of the residuals). I modified Aaron's code to filter by deviance obtained from a null model (intercept only), and there is no obvious p-value inflation.

library(edgeR)

group <- gl(2,4)
design <- model.matrix(~group)
design.null = model.matrix(~1, data = data.frame(x = group))

set.seed(999)
disp <- 0.1
y <- matrix(rnbinom(80000, mu=100, size=1/disp), ncol=nrow(design), byrow=TRUE)

d <- DGEList(y)
d <- estimateDisp(d, design)
fit <- glmQLFit(d, design, robust=TRUE)
res <- glmQLFTest(fit, coef="group2")

d.null = estimateDisp(d, design.null)
fit.null <- glmQLFit(d.null, design.null, robust=TRUE)

par(mfrow=c(1,2))
hist(res$table$PValue) # Looks uniform.
mean(res$table$PValue <= 0.1) # Controlled below threshold.
## [1] 0.0956

# Filtering for low-deviance sites, analogous to low variance.
keep <- fit.null$deviance < median(fit$deviance)

# Repeating empirical Bayes with filtered sites:
fit2 <- glmQLFit(d[keep,], design, robust=TRUE)
res2 <- glmQLFTest(fit2, coef="group2")
hist(res2$table$PValue)
mean(res2$table$PValue <= 0.1)

## [1] 0.07942708

Perhaps the problems may become clearer if you drop the type I error threshold:

library(edgeR)

group <- gl(2,4)
design <- model.matrix(~group)

set.seed(999)
disp <- 0.1
y <- matrix(rnbinom(400000, mu=100, size=1/disp), ncol=nrow(design), byrow=TRUE)
# Increased the number of iterations for stable type I error rate estimates.

d <- DGEList(y)
d <- estimateDisp(d, design)
fit <- glmQLFit(d, design, robust=TRUE)
res <- glmQLFTest(fit, coef="group2")

par(mfrow=c(1,2))
hist(res$table$PValue) # Looks uniform.
mean(res$table$PValue <= 0.01) # Controlled at threshold.
## [1] 0.00988

# Following Peter's suggestion; slightly corrected median() call.
design.null <- design[,1,drop=FALSE]
d.null <- estimateDisp(d, design.null)
fit.null <- glmQLFit(d.null, design.null, robust=TRUE)
keep <- fit.null$deviance < median(fit.null$deviance)

# Repeating empirical Bayes with filtered sites:
fit2 <- glmQLFit(d[keep,], design, robust=TRUE)
res2 <- glmQLFTest(fit2, coef="group2")
hist(res2$table$PValue) # Not uniform; see the far left!
mean(res2$table$PValue <= 0.01) # ~8-fold below the threshold.
## [1] 0.00122


So the opposite problem now occurs where you get a depletion of low p-values. Not surprising, as you're removing features where the deviance under the null model is high - that is, those features that are likely to be differential and thus of interest in the first place! Moral of the story: any filtering on the GLM deviance is dangerous.