Search
Question: Modeling input data in ChIP-seq experiments with EdgeR, DESeq(2) or Limma
0
15 months ago by
European Union
k.vitting.seerup20 wrote:

I have a ChIP-seq dataset with two conditions. For each condition I have two replicates of both input and the acutal ChIP experiment: 8 samples in total.

In R the design matrix would look like this:

myDesing <- data.frame(sample=1:8, data=c(rep('Input',4), rep('Chip',4)), condition=c('Wt','Kd'))
myDesing$condition <- factor(myDesing$condition, levels=c('Wt','Kd'))
myDesing$data <- factor(myDesing$data, levels=c('Input','Chip'))

> model.matrix(data=myDesing, ~data + condition)
(Intercept) dataChip conditionKd
1           1        0           0
2           1        0           1
3           1        0           0
4           1        0           1
5           1        1           0
6           1        1           1
7           1        1           0
8           1        1           1

I'm interested in how to handle the input data in regards to doing a differential binding analysis with edgeR, DESeq(2) or Voom-Limma.

Is it possible to incorporating the input data into the linear model

My idea is to make a model that includes both the input and chip libraries and then test the interaction between chip and condition. More specifically for the R code example above this would correspond to testing the interaction dataChip::conditionKd. Does this make sense or does it violate some of the assumptions?

Does this not correspond to measuring the fold change of Input vs WT ( FCiwt ) and the fold change of Input vs KD (FCikd) and then comparing whether there is a difference between FCiwt and FCikd?

I have read this DESeq2 for ChIP-seq differential peaks which discuss whether one can subtract the input reads from before the DE analysis but it is a bit unsatisfactory as the discussion did not result in a good way to incorporate the input data.

Kristoffer

modified 15 months ago by Aaron Lun21k • written 15 months ago by k.vitting.seerup20
5
15 months ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

Consider:

FCiwt = WT - Input
FCikd = KD - Input
FCikd - FCiwt = KD - Input - (WT - Input) = KD - WT

So the inputs just end up cancelling out in the contrast - you might as well just compare WT and KD directly.

The more interesting question is what happens when you have WT- and KD-specific inputs, in which case you could theoretically compare the ChIP/input log-fold changes between WT and KD. In practice, this tends to be rather disappointing. Changes in input coverage are meant to account for changes in chromatin state, but for many protein targets, changes in binding often coincide with changes in chromatin state (e.g., as the DNA opens up). This means that dividing by input will cancel out the changes in binding in the ChIP samples. For example, if I did a perturbation and got a two-fold increase in the binding of my protein target at a location with a two-fold increase in accessibility (and thus input coverage), I would never be able to detect this as DB in a comparison of the ChIP/input fold changes. This is because the two fold increases mentioned above would just cancel out.

The approach I would recommend (repeated somewhere in the comments of the link in your post) is to use the inputs to identify problematic high-coverage regions, e.g., with the GreyListChIP package. These can be ignored - they're unlikely to be bound by the protein, if they already have high coverage in the inputs - which reduces the number of problematic regions in your DB analysis.

As I and others have mentioned, subtracting counts is a Bad Idea for count-based models.

Thanks! As i have condition specific input I will try it :-)

Do you think it is a problem for the normalization that the input libraries are 5-20x smaller than the acutal chip-seq libraries meaning 5-20x fewer counts in the peaks identified ?

In and of itself, differences in sequencing depth between samples are not a problem, as they are automatically modelled after normalization. However, in practice, there are (at least) two issues. Computationally, if the counts are too low, it may not be possible to compute sensible normalization factors. You may need to use fairly large regions of the genome - see the binning approach in the csaw package. The problem is exacerbated if you try to test for differences of ChIP/input log-fold changes; if both inputs are zero, you won't reject the null hypothesis, regardless of what happens to the ChIP samples.

From an experimental perspective, large differences in coverage may be symptomatic of other problems, e.g., failed library preparation. This is especially true for input controls, which should have plenty of DNA available; in contrast, low depth for IgG controls can be excused as a good pulldown protocol should result in a lot less DNA. Large differences in DNA concentration during library preparation may also result in other anomalies, e.g., excessive PCR duplicates in low-diversity libraries.