8 months ago by

Cambridge, United Kingdom

To detour a bit from your actual questions: you don't mention exactly what you've already done,
but peak-calling and differential analysis is surprisingly difficult to get right. This is because
peak calling is already a filtering step, and if it's not an independent filter, then you'll have lost
type I error control before you even start using *edgeR* (or *DESeq2*, or other count-based modelling
approaches). The statistical arguments are subtle, but perhaps a demonstration will suffice:

```
# Using a normal simulation for simplicity's sake.
# Just pretend this is some measure of log-binding intensity.
# Each row is a binding site, each column is a sample.
binding <- matrix(rnorm(80000), ncol=4)
group <- gl(2,2)
# Let's say our peak-calling method only detects peaks at
# a log-intensity above 1.
is.peak <- binding >= 1
# Let's say we only keep peaks that are detected in 2+ libraries.
keep <- rowSums(is.peak) >= 2
# Now let's do a t-test on the remaining peaks,
# comparing the first and last two libraries.
p <- apply(binding[keep,], 1, FUN=function(x) t.test(x[1:2], x[3:4])$p.value)
hist(p)
```

Woah! That's not uniform, even though the null hypothesis is true for all binding sites.
The reason is because you're implicitly selecting for sites that
are either (i) differential between conditions, by enriching for peaks where high binding is
present in two libraries of the same condition; or (ii) highly variable within conditions, by
enriching for peaks with high binding in two libraries of different conditions.

Normally, the exact filter strategy does not matter much when you're using *edgeR* for RNA-seq, as
the density of genes around the filter boundary is low, i.e., changing the filtering strategy
will not have a major effect on the identity of the retained genes. However, this is not true
for ChIP-seq data, where the frequency of binding sites decreases with increasing binding intensity.
This means that the density of sites around the filter boundary is always relatively high compared to
the total number of retained sites, such that the choice of filtering strategy will have a larger
effect on the type I error rate amongst the retained peaks.

The solution is to do peak calling independently. This involves pooling all libraries together,
with equal contributions from each library, and then calling peaks on the pooled library.
This ensures that the only information
available to the peak caller is the total read count, with no possibility of looking at
differences between libraries. A related approach is described in the *csaw* user's guide,
where we use `aveLogCPM`

to perform various flavours of filtering; this provides an
independent filter statistic (probably, it's hard to prove but it seems to work well)
for negative binomial models. **Note:** there's no point doing this on your peaks if your
peak calling strategy wasn't independent - the damage has already been done.

Anyway, I think I answered question 2 above. For question 1; the bigger question is whether
you assume that most bound regions are non-DB. This is because most normalization methods
(presumably CQN as well, based on my recollection of the method) assume that most of the input
features are not differential across samples. If you can make that assumption, then yes, I
would filter to get the bound regions and perform normalization on the filtered sites. However,
some applications involve global changes in binding, e.g., drug treatments that suppress histone
methylation, global downregulation of the targeted TF. In such cases, you cannot make the non-DB
assumption, requiring other approaches. I'll guess that you don't have chromatin spike-ins, so
check out the binned TMM method in the *csaw* user's guide.