Question

Bounded Regression with Limma for RRBS?

0

Entering edit mode

nhejazi • 0

@nhejazi-10825

Last seen 4.7 years ago

UC Berkeley

Main Question:

I have been working on an analysis of RRBS data, using various utility function to generate a data matrix that consists of averages of proportions of methylated reads at a list of CpG islands for each of the n = 59 subjects (the dimension of the matrix is roughly 15,000 x 59). The `limma::lmFit` method (and related functions) are then used, along with an appropriate design matrix, to assess any evidence of differential methylation at these CpG islands. I am concerned that the bounded nature of the outcome variable (linear regression is of course unsuitable for bounded outcomes) is preventing any potential signal in the data from being picked up. Does Limma accommodate such analyses, or are there transformations of the data that may help?

Background:

In working on an analysis of RRBS data, I have performed preprocessing on BAM files for n = 59 subjects using Bismark. I have then imported the coverage files produced by Bismark into R using the `bsseq::read.bismark` function to generate an object of class BSseq. I then extract the coverage and count of methylated reads from the BSseq object, generating a data structure of dimension ~10e6 x 59. After dropping CpG loci with insufficient coverage, a single data.frame is generated that contains the proportion of methylated reads at each CpG locus -- the new matrix is roughly 3,000,000 x 59. A table of CpG islands is then extract from the UCSC genome browser and the methylation proportion values for CpG sites falling into the same defined island region are averaged. This summarized data structure is then analyzed with `limma::lmFit` and various associated functions (`eBayes`, `topTable`, etc.), with an appropriate design matrix. The result is a table of CpG islands assessed for any evidence of differential methylation. It is clear that linear modeling is not a suitable method for assessing bounded outcomes (the averaged proportions are of course in [0, 1]). Are there suggestions on how to properly transform the data to deal with this problem? I have seen general approaches to bounded regression but nothing specific to Limma and related bioinformatical tools.

Any help/advice would be much appreciated.

limma rrbs bsseq • 1.4k views

ADD COMMENT • link updated 7.9 years ago by Aaron Lun ★ 28k • written 7.9 years ago by nhejazi • 0

0

Entering edit mode

I can't answer your question fully as I don't have experience with methyl-seq, but for methylation arrays my understanding is that one typically runs limma on the M-values, which are the result of a logistic transformation of the methylation proportions. This handles the issue that proportions are bounded between 0 and 1 and decidedly non-normal as a result, but it doesn't handle the discreteness that exists in methyl-seq but not methylation array data. I don't know how you would handle that.

ADD REPLY • link 7.9 years ago Ryan C. Thompson ★ 7.9k

score 4 · Answer 1 · 2016-06-15

I don't analyze a lot of methylation data either, but an alternative approach is to run edgeR on the raw counts for each CpG island. This would solve the discreteness issues that Ryan mentioned. For simplicity, let's consider a single island. Each patient has two counts for this island, one for methylated reads and another for unmethylated reads. Further assume that patients are split into two groups. We can now fit a generalized linear model to these counts, blocking on the patient and including methylation:group interaction terms. Each interaction term represents the log-fold change between methylated and unmethylated counts for one group; we can compare them against each other to test for differences in methylation between groups. More succinctly:

patient <- factor(rep(1:10, each=2))
methtype <- factor(rep(c("m", "u"), 10))
grouping <- factor(rep(c("A", "B"), each=10))
design <- model.matrix(~0 + patient + methtype:grouping)
design <- design[,-grep("methtypeu", colnames(design))] # get to full rank

Looking at colnames(design) gives us:

 [1] "patient1"            "patient2"            "patient3"           
 [4] "patient4"            "patient5"            "patient6"           
 [7] "patient7"            "patient8"            "patient9"           
[10] "patient10"           "methtypem:groupingA" "methtypem:groupingB"

The first 10 terms are patient blocking factors that aren't interesting. The last two terms represent the log-fold change of the methylated count over the unmethylated count, averaged across all patients in the named group. To test for differences in methylation between groups, we can just set our glmLRT (or glmQLFTest, for more limma-like behaviour) contrast to:

con <- c(integer(10), 1, -1)

... which tests whether the last two coefficients have the same value. This approach can probably be generalized to arbitrarily complicated patient structures and groupings. The trick is to imagine the log-fold change between the methylated/unmethylated count within each patient as the value of interest, and then work back from there to construct the full design matrix for the counts.

I haven't tested this strategy on real BS-seq data, so I don't know how well it'll work, but it might be worth a shot.