I have been running an analysis of RRBS data using edgeR, following the guide in the edgeR manual. The metadata looks like this:
Everything has been done as in the manual, with the exception of the design matrix and contrast which is:
designSL <- model.matrix(~0+Condition + Sample, data=targets) design <- modelMatrixMeth(designSL) contr <- makeContrasts(Condition = ConditionPost - ConditionPre, levels=design)
When I plot a histogram of the raw P-values it looks like so:
Two things strike me as odd, the high number of sites with a P-value close to 1, and the overrepresentation of sites with a P-value around 0.2.
The sites with a P-value close to 1 are (almost) 100% or 0% methylated in all samples (they have 0 counts of either methylated or unmethylated C's in almost all samples). I don't know what characterizes the sites with a P-value around 0.2.
I have tried removing sites with a constant methylation level, but that does not solve the issue (most site have at least one observation in at least one sample)
My questions are:
Is it reasonable/safe to remove all sites that have close to 0 or 100% methylation in all samples? What would be a reasonable heuristic?
What can cause a bump around P = 0.2? Can this be remedied somehow?