Modeling input data in ChIP-seq experiments with EdgeR, DESeq(2) or Limma
Entering edit mode
Last seen 11 weeks ago
European Union

I have a ChIP-seq dataset with two conditions. For each condition I have two replicates of both input and the acutal ChIP experiment: 8 samples in total.

In R the design matrix would look like this:

    myDesing <- data.frame(sample=1:8, data=c(rep('Input',4), rep('Chip',4)), condition=c('Wt','Kd'))
    myDesing$condition <- factor(myDesing$condition, levels=c('Wt','Kd'))
    myDesing$data <- factor(myDesing$data, levels=c('Input','Chip'))

> model.matrix(data=myDesing, ~data + condition)
  (Intercept) dataChip conditionKd
1           1        0           0
2           1        0           1
3           1        0           0
4           1        0           1
5           1        1           0
6           1        1           1
7           1        1           0
8           1        1           1

I'm interested in how to handle the input data in regards to doing a differential binding analysis with edgeR, DESeq(2) or Voom-Limma. 

Is it possible to incorporating the input data into the linear model

My idea is to make a model that includes both the input and chip libraries and then test the interaction between chip and condition. More specifically for the R code example above this would correspond to testing the interaction dataChip::conditionKd. Does this make sense or does it violate some of the assumptions?

Does this not correspond to measuring the fold change of Input vs WT ( FCiwt ) and the fold change of Input vs KD (FCikd) and then comparing whether there is a difference between FCiwt and FCikd?

I have read this DESeq2 for ChIP-seq differential peaks which discuss whether one can subtract the input reads from before the DE analysis but it is a bit unsatisfactory as the discussion did not result in a good way to incorporate the input data.

In advance thanks for your help.

edgeR deseq2 chipseq limma • 1.1k views
Entering edit mode
Aaron Lun ★ 27k
Last seen 22 minutes ago
The city by the bay


FCiwt = WT - Input
FCikd = KD - Input
FCikd - FCiwt = KD - Input - (WT - Input) = KD - WT

So the inputs just end up cancelling out in the contrast - you might as well just compare WT and KD directly.

The more interesting question is what happens when you have WT- and KD-specific inputs, in which case you could theoretically compare the ChIP/input log-fold changes between WT and KD. In practice, this tends to be rather disappointing. Changes in input coverage are meant to account for changes in chromatin state, but for many protein targets, changes in binding often coincide with changes in chromatin state (e.g., as the DNA opens up). This means that dividing by input will cancel out the changes in binding in the ChIP samples. For example, if I did a perturbation and got a two-fold increase in the binding of my protein target at a location with a two-fold increase in accessibility (and thus input coverage), I would never be able to detect this as DB in a comparison of the ChIP/input fold changes. This is because the two fold increases mentioned above would just cancel out.

The approach I would recommend (repeated somewhere in the comments of the link in your post) is to use the inputs to identify problematic high-coverage regions, e.g., with the GreyListChIP package. These can be ignored - they're unlikely to be bound by the protein, if they already have high coverage in the inputs - which reduces the number of problematic regions in your DB analysis.

As I and others have mentioned, subtracting counts is a Bad Idea for count-based models.

Entering edit mode

Thanks! As i have condition specific input I will try it :-)

Entering edit mode

Do you think it is a problem for the normalization that the input libraries are 5-20x smaller than the acutal chip-seq libraries meaning 5-20x fewer counts in the peaks identified ?

Entering edit mode

In and of itself, differences in sequencing depth between samples are not a problem, as they are automatically modelled after normalization. However, in practice, there are (at least) two issues. Computationally, if the counts are too low, it may not be possible to compute sensible normalization factors. You may need to use fairly large regions of the genome - see the binning approach in the csaw package. The problem is exacerbated if you try to test for differences of ChIP/input log-fold changes; if both inputs are zero, you won't reject the null hypothesis, regardless of what happens to the ChIP samples.

From an experimental perspective, large differences in coverage may be symptomatic of other problems, e.g., failed library preparation. This is especially true for input controls, which should have plenty of DNA available; in contrast, low depth for IgG controls can be excused as a good pulldown protocol should result in a lot less DNA. Large differences in DNA concentration during library preparation may also result in other anomalies, e.g., excessive PCR duplicates in low-diversity libraries. 


Login before adding your answer.

Traffic: 233 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6