Question

csaw - workflow to incorporate input/control samples?

1

Entering edit mode

markus.ringner ▴ 10

@markusringner-10600

Last seen 8.0 years ago

We are currently using csaw to investigate differential binding in a set of ChIP-seq experiments. I have some samples for which the control/input samples are highly variable across the genome and I am wondering if someone has experience and insight into the best way to include input-experiments in the csaw differential binding pipeline?

I will try to explain our observations with a simplified example. We are comparing multiple ChIPs (mostly histone modifications) across very homogeneous specific cell types. Consider two cell types A and B, where A is characterized by very high expression of geneA. If I run the "standard" csaw pipeline (with parameters optimized for our data) I get nice
results that we validate with other techniques except for geneA and its flanking genes.

For geneA we get that a mark/TF is more bound in celltype A compared to celltype B, regardless of which mark/TF we look at. The reason is that the background/non-specific signal is much stronger in this region for celltype A. The input (negative control) for celltype A is essentially a square function around gene A and its flanking genes in celltype A (much higher than the surrounding background).

Which approach to include input (negative control libraries) in the csaw analysis pipeline is considered the "best practice"?

csaw chipseq differential binding analysis • 2.8k views

ADD COMMENT • link updated 8.0 years ago by Aaron Lun ★ 28k • written 8.0 years ago by markus.ringner ▴ 10

score 3 · Answer 1 · 2016-04-29

There are three possible approaches to incorporating input data to fix your problem, though only one that I'd actually recommend.

Let's start with the obvious-but-inadvisable approach: subtraction of input counts from ChIP counts for each window. This is the most intuitive strategy, as you get rid of the contribution from background enrichment to get the effect of specific binding for your protein of interest. However, the downstream analysis with edgeR (or DESeq - see discussion at A: DESeq2 for ChIP-seq differential peaks) will no longer be valid. This is because the mean-variance relationship will be totally messed up - the subtraction will effectively shuffle points along the x-axis of the BCV plot, breaking the nice decreasing trend and compromising model accuracy. Needless to say, the counts won't be easily modelled with a NB distribution anymore, and you'll have the additional problem of what to do with negative values after subtraction. Sure, you could set a lower bound at zero, but that'll lead to a probability density at zero which isn't easily handled by standard NB models without zero inflation.

The second, slightly better approach is to fit a GLM with four groups - one for ChIP in cell type A; ChIP in cell type B; input in A, and input in B. You would then test for differential DB of ChIP vs. input between cell types. Any enrichment in the ChIP A group that also has a concomitant enrichment in the input A group would cancel out, thus avoiding detection of spurious DB between cell types that was driven only by some chromatin accessibility effect. In practice, though, this approach is not recommended because, in many cases, having more accessible chromatin is correlated with increased binding of various factors. If you get an increase in input A due to increased accessibility, it's entirely possible that you would also observe a genuine increase in binding of your target protein at this newly accessible site in ChIP A. Cancelling out this increase would reduce your ability to detect real changes in binding.

The final, and recommended, approach is simply to ignore these regions. This could be done retrospectively, by just ignoring them in your results during interpretation of the DB list; or proactively, by screening them out with blacklists (pre-defined or empirically identified with methods like GreyListChIP). The idea here is that it is not easy to remove the effect of increased chromatin accessibility/DNA copy number/whatever without screwing up the DB statistics. Thus, the safest approach is to remove the most problematic cases; and for a handful of regions in a genomics experiment, that isn't such a big loss (especially given the alternatives would affect the analysis of all regions). That said, it's usually a good idea to try to figure out why these regions are problematic - for example, are they rRNA genes, or something with loads of repeats? I find that microsatellites are often the biggest culprits for big blocks of signal.