Question

MedIP-Seq: IP over Input with edgeR

1

Entering edit mode

Arne Muller ▴ 20

@arne-muller-8308

Last seen 7.9 years ago

United States

Hello,

I'd like to run edgeR on a MeDIP-Seq experiment with 4 treatment groups (each with 5 samples). I'm counting reads in slides over the genome (a few hundred bases with some overlap).

I realised that there is a small batch effect as both IP and input separates per treatment group in an MDS plot (limma::plotMDS) of the count matrix. I can't simply add a batch factor to my model as this is confounded with my treatment factor. However, per sample I have IP and input, and it might be that I can remove the batch effect when I remove the input counts from the IP, but I'd first need to scale each sample for library size as the input samples have only be sequenced with half the coverage of the IP samples.

Should I subtract the normalised input from my IP per window when the downstream analysis is done with edgeR or would this violate the statistical assumptions or interfere with edgeR's internal normalisation of the data?

thanks for help

Arne

edge • 1.5k views

ADD COMMENT • link updated 8.5 years ago by Aaron Lun ★ 28k • written 8.5 years ago by Arne Muller ▴ 20

1

Entering edit mode

You'll want to fix the tag, otherwise the edgeR maintainers won't see this. Fortunately for you, I'm here all the time.

ADD REPLY • link 8.5 years ago Aaron Lun ★ 28k

score 1 · Answer 1 · 2015-10-22

Our advice is: don't do input subtraction. This will break the mean-variance modelling in edgeR. For example, let's say you have two windows; one with large IP and input counts (I'll call this window X), and another with small IP and input counts (Y). Under any realistic model like the NB model used in edgeR, the variance of the counts for window X should be higher than that of Y, simply because the counts are larger and have more scope to vary.

Now, assume that subtraction is performed such that the average of the "IP - input" counts is the same between both windows. This means that we end up with a window that has a large variance (i.e., X), at the same mean as a window with a smaller variance (Y). edgeR cannot model this properly during empirical Bayes shrinkage, as the mean-dispersion trend will only have a single fitted value for the dispersion at any given mean. Should that fitted value be optimized for modelling the variance of X, such that the variance of Y will not be modelled properly; or Y, such that if will fail for X; or try to go for some compromise, in which case neither window will be modelled properly? In fact, the trend fitting itself will be confounded if all of the windows are crammed into a smaller range of mean values after subtraction.

And that's not even considering problems with negative counts. This is mostly problematic if your input and IP signals are of comparable size, such that there is a decent chance that the former is larger than the latter. You could apply a floor of zero to the post-subtraction counts, but then you might end up with substantial zero inflation that affects the accuracy of the NB model. I also note that you've done a scaling approach to adjust for library size prior to subtraction. Doing this properly is not trivial, e.g., classic edgeR uses a sophisticated quantile-matching strategy in q2qnbinom to preserve the mean-variance relationship upon library size adjustment.

Check out some discussions at A: DESeq2 for ChIP-seq differential peaks for an equivalent problem in ChIP-seq data. Finally, as an aside, it seems like you're doing a window-based analysis of this data. I suspect that you may already know about csaw, based on our off-list conversations; it might be worth testing it out on this data set.