2.5 years ago by
Cambridge, United Kingdom
There are three possible approaches to incorporating input data to fix your problem, though only one that I'd actually recommend.
Let's start with the obvious-but-inadvisable approach: subtraction of input counts from ChIP counts for each window. This is the most intuitive strategy, as you get rid of the contribution from background enrichment to get the effect of specific binding for your protein of interest. However, the downstream analysis with edgeR (or DESeq - see discussion at A: DESeq2 for ChIP-seq differential peaks) will no longer be valid. This is because the mean-variance relationship will be totally messed up - the subtraction will effectively shuffle points along the x-axis of the BCV plot, breaking the nice decreasing trend and compromising model accuracy. Needless to say, the counts won't be easily modelled with a NB distribution anymore, and you'll have the additional problem of what to do with negative values after subtraction. Sure, you could set a lower bound at zero, but that'll lead to a probability density at zero which isn't easily handled by standard NB models without zero inflation.
The second, slightly better approach is to fit a GLM with four groups - one for ChIP in cell type A; ChIP in cell type B; input in A, and input in B. You would then test for differential DB of ChIP vs. input between cell types. Any enrichment in the ChIP A group that also has a concomitant enrichment in the input A group would cancel out, thus avoiding detection of spurious DB between cell types that was driven only by some chromatin accessibility effect. In practice, though, this approach is not recommended because, in many cases, having more accessible chromatin is correlated with increased binding of various factors. If you get an increase in input A due to increased accessibility, it's entirely possible that you would also observe a genuine increase in binding of your target protein at this newly accessible site in ChIP A. Cancelling out this increase would reduce your ability to detect real changes in binding.
The final, and recommended, approach is simply to ignore these regions. This could be done retrospectively, by just ignoring them in your results during interpretation of the DB list; or proactively, by screening them out with blacklists (pre-defined or empirically identified with methods like GreyListChIP). The idea here is that it is not easy to remove the effect of increased chromatin accessibility/DNA copy number/whatever without screwing up the DB statistics. Thus, the safest approach is to remove the most problematic cases; and for a handful of regions in a genomics experiment, that isn't such a big loss (especially given the alternatives would affect the analysis of all regions). That said, it's usually a good idea to try to figure out why these regions are problematic - for example, are they rRNA genes, or something with loads of repeats? I find that microsatellites are often the biggest culprits for big blocks of signal.
modified 2.5 years ago
2.5 years ago by
Aaron Lun • 21k