I have been following the paper "Differential methylation analysis of reduced representation bisulfite sequencing experiments using edgeR" and I have some problems with the low count filtering process.
Here is the experimental design I have:
meta <- data.frame(library = as.factor(1:18), patient = as.factor(c(rep(1:3, each = 4), rep(4:6, each = 2))), disease = c(rep("sick", each = 12), rep("healthy", each = 6)), tissue = c(rep(c("liver","muscle"), times = 6), rep(c("liver","muscle"), times = 3)), treatment = c(rep(c("placebo","drug"), each = 6), rep("placebo", each = 6)) ) meta$group <- paste(meta$disease, meta$tissue, meta$treatment, sep = ".")
I know that it is kind of a weird setup as the
healthy patients didn't receive a "drug" as a
treatment. That's why I intended to use the approach with
voomLmFit I asked earlier (post here).
PS: I tried to keep the experimental setup as simple as possible for easy communication. I have more more than 10 patients with the total number of 44 libraries.
When I do the recommended filtering for the CpG sites, I end up with so few reads.
## The following code is similar to the one on the RRBS edgeR paper ## (...) Coverage <- yall$counts[, Methylation=="Me"] + yall$counts[, Methylation=="Un"] min.coverage <- 8 nObs <- ncol(Coverage) keep <- rowSums(Coverage >= min.coverage) == nObs table(keep) # FALSE TRUE # 8023559 1042 ## (...)
I have only 1042 reads, which have at least 8 counts across all the samples, out of ~8 million.
I would like to get your feedback about how to filter the low counts in such a dataset. As the edgeR paper on RRBS differential analysis mentions, the filtering process can be somewhat relaxed, but I am not sure how. Therefore, I tried some of the following approaches:
Approach #1: filterByExpr()
I am aware of the fact that Gordon Smyth did not recommend the usage of
filterByExpr on RRBS data (post here) as the function was originally designed for RNAseq filtering. I gave it a try anyways as I was running out of options and here are the results:
keep <- filterByExpr(Coverage, group = meta$group) table(keep) # FALSE TRUE # 5179460 2845141
Approach #2: Keep at least "n" samples that has enough coverage in every group
Let's assume that we would like to keep CpGs that has 8 coverage in at least 2 samples per group.
min.coverage <- 8 # minimum coverage for a CpG site pass <- (Coverage >= min.coverage) * 1 # convert logical to numeric pass_per_group <- t(rowsum(t(pass), group = meta$group)) min.n <- 2 # minimum number of samples with enough coverage keep <- rowSums(pass_per_group >= min.n) == length(unique(meta$group)) table(keep) # keep # FALSE TRUE # 7998435 26166
Unfortunately, it is still not high enough coverage. Besides, I am not sure how such filtering might affect the differential analysis.
Question: I believe that the approaches above are not ideal. May I ask for your help if you have a better way of filtering the low RRBS counts? Or should I live with 1024 CpG sites after recommended filtering.
Thanks in advance for your help.