Filtering low coverage RRBS counts results in few reads. Any alternatives?
2
0
Entering edit mode
@altintasali-11054
Last seen 12 weeks ago
Denmark

I have been following the paper "Differential methylation analysis of reduced representation bisulfite sequencing experiments using edgeR" and I have some problems with the low count filtering process.

# Experimental design

Here is the experimental design I have:

meta <- data.frame(library = as.factor(1:18),
patient = as.factor(c(rep(1:3, each = 4), rep(4:6, each = 2))),
disease = c(rep("sick", each = 12), rep("healthy", each = 6)),
tissue = c(rep(c("liver","muscle"), times = 6), rep(c("liver","muscle"), times = 3)),
treatment = c(rep(c("placebo","drug"), each = 6), rep("placebo", each = 6))
)
meta$group <- paste(meta$disease, meta$tissue, meta$treatment, sep = ".")


I know that it is kind of a weird setup as the healthy patients didn't receive a "drug" as a treatment. That's why I intended to use the approach with voomLmFit I asked earlier (post here).

PS: I tried to keep the experimental setup as simple as possible for easy communication. I have more more than 10 patients with the total number of 44 libraries.

# Problem: Filtering

When I do the recommended filtering for the CpG sites, I end up with so few reads.

## The following code is similar to the one on the RRBS edgeR paper
## (...)
Coverage <- yall$counts[, Methylation=="Me"] + yall$counts[, Methylation=="Un"]
min.coverage <- 8
nObs <- ncol(Coverage)
keep <- rowSums(Coverage >= min.coverage) == nObs
table(keep)
#   FALSE    TRUE
# 8023559    1042

## (...)


I have only 1042 reads, which have at least 8 counts across all the samples, out of ~8 million.

# Question

I would like to get your feedback about how to filter the low counts in such a dataset. As the edgeR paper on RRBS differential analysis mentions, the filtering process can be somewhat relaxed, but I am not sure how. Therefore, I tried some of the following approaches:

Approach #1: filterByExpr()

I am aware of the fact that Gordon Smyth did not recommend the usage of filterByExpr on RRBS data (post here) as the function was originally designed for RNAseq filtering. I gave it a try anyways as I was running out of options and here are the results:

keep <- filterByExpr(Coverage, group = meta$group) table(keep) # FALSE TRUE # 5179460 2845141  Approach #2: Keep at least "n" samples that has enough coverage in every group Let's assume that we would like to keep CpGs that has 8 coverage in at least 2 samples per group. min.coverage <- 8 # minimum coverage for a CpG site pass <- (Coverage >= min.coverage) * 1 # convert logical to numeric pass_per_group <- t(rowsum(t(pass), group = meta$group))
min.n <- 2                             # minimum number of samples with enough coverage
keep <- rowSums(pass_per_group >= min.n) == length(unique(meta\$group))
table(keep)
# keep
#   FALSE    TRUE
# 7998435   26166


Unfortunately, it is still not high enough coverage. Besides, I am not sure how such filtering might affect the differential analysis.

Question: I believe that the approaches above are not ideal. May I ask for your help if you have a better way of filtering the low RRBS counts? Or should I live with 1024 CpG sites after recommended filtering.

limma edgeR • 231 views
0
Entering edit mode
@gordon-smyth
Last seen 2 hours ago
WEHI, Melbourne, Australia

Your data seems to have very low coverage, and there is no way for a filtering step to solve that. Your approach #2 seems reasonable. You could lower the min.coverage further without any negative consequences but you probably won't end up with many significant differential CpGs. There's no way to change that -- it's an inevitable consequence of low coverage.

You could consider consolidating CpG counts by genomic region to get larger counts.

0
Entering edit mode
@franceschinigianmarco-16472
Last seen 5 months ago
Switzerland

Have you considered smoothing approaches? Maybe BSmooth or similar approaches could be helpful in your setting ( see https://genomebiology.biomedcentral.com/articles/10.1186/gb-2012-13-10-r83 ). The main idea is to exploit the strong local correlation of CpGs to retain more information, without using strong filters on the coverage. Overlapping CpGs with high coverage from multiple experiments often results in that culprit. I hope this could be useful!