4.1 years ago by
Cambridge, United Kingdom
Our advice is: don't do input subtraction. This will break the mean-variance modelling in
edgeR. For example, let's say you have two windows; one with large IP and input counts (I'll call this window X), and another with small IP and input counts (Y). Under any realistic model like the NB model used in
edgeR, the variance of the counts for window X should be higher than that of Y, simply because the counts are larger and have more scope to vary.
Now, assume that subtraction is performed such that the average of the "IP - input" counts is the same between both windows. This means that we end up with a window that has a large variance (i.e., X), at the same mean as a window with a smaller variance (Y).
edgeR cannot model this properly during empirical Bayes shrinkage, as the mean-dispersion trend will only have a single fitted value for the dispersion at any given mean. Should that fitted value be optimized for modelling the variance of X, such that the variance of Y will not be modelled properly; or Y, such that if will fail for X; or try to go for some compromise, in which case neither window will be modelled properly? In fact, the trend fitting itself will be confounded if all of the windows are crammed into a smaller range of mean values after subtraction.
And that's not even considering problems with negative counts. This is mostly problematic if your input and IP signals are of comparable size, such that there is a decent chance that the former is larger than the latter. You could apply a floor of zero to the post-subtraction counts, but then you might end up with substantial zero inflation that affects the accuracy of the NB model. I also note that you've done a scaling approach to adjust for library size prior to subtraction. Doing this properly is not trivial, e.g., classic
edgeR uses a sophisticated quantile-matching strategy in
q2qnbinom to preserve the mean-variance relationship upon library size adjustment.
Check out some discussions at A: DESeq2 for ChIP-seq differential peaks for an equivalent problem in ChIP-seq data. Finally, as an aside, it seems like you're doing a window-based analysis of this data. I suspect that you may already know about
csaw, based on our off-list conversations; it might be worth testing it out on this data set.
modified 4.1 years ago
4.1 years ago by
Aaron Lun • 25k