Search
Question: complex design for chip-seq analysis
0
12 months ago by
Assa Yeroslaviz1.4k
Munich, Germany
Assa Yeroslaviz1.4k wrote:

I have a chip-seq data set with three conditions (wt , KO1, KO2) and three time points (0h, 2h, 4h). For each condition and TP I have duplicated samples for IP and Input samples, which looks like that:

 condition TP Replicate IP Input wt 0h 1 IP Input wt 0h 2 IP Input wt 2h 1 IP Input wt 2h 2 IP Input wt 4h 1 IP Input wt 4h 2 IP Input KO1 0h 1 IP Input KO1 0h 2 IP Input KO1 2h 1 IP Input KO1 2h 2 IP Input KO1 4h 1 IP Input KO1 4h 2 IP Input KO2 0h 1 IP Input KO2 0h 2 IP Input KO2 2h 1 IP Input KO2 2h 2 IP Input KO2 4h 1 IP Input KO2 4h 2 IP Input

I would like not only to do a pairwise comparison of IP vs. Input, but also to be able to do a more complex comparisons. I would like for example to wee if there are any differential binding events based on the differences over time (something like a time-series analysis) within each condition (e.g. comparing wt.0h - wt.2h - wt.4h).

But to make it even more complicated, I would also like to make a comparison of the different time-series against each other. What I mean by that is for example to compare differential binding events from the wt time-series to KO1 or to KO2. In this case I would like to see if there are peaks (= overlapping genes) in the wt time-series which are significantly different than the peaks (or regions). in the KO sample time-series.

I know this is not so easy, as peaks in one conditions/time-point don't necessarily mean also peaks in a different samples. But these kind of behavior is exactly what i would like to catch. If there is a significant peak in the wt sample, but no peak in the KO I would like to see it. This would a normal peak calling, but when I have multiple crossing samples, I'm not sure how to do the analysis.

thanks, Assa

modified 12 months ago by Aaron Lun21k • written 12 months ago by Assa Yeroslaviz1.4k
1
12 months ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

Much of this seems to be doable with judicious use of makeContrasts for the contrast= argument in glmQLFTest (assuming you're using edgeR). If you want to determine whether time has any effect:

con <- makeContrasts(wt.2h - wt.0h, wt.4h - wt.0h, levels=design)


... which asks whether there's any differences between the time points. This is the closest you will get to a time series analysis within each genotype. There's not much point fitting a spline or doing anything more complicated, given that you only have three time points to work with.

If you want to find differential time responses between genotypes:

con <- makeContrasts((wt.2h - wt.0h) - (ko.2h - ko.0h),
(wt.4h - wt.0h) - (ko.4h - ko.0h), levels=design)


... which asks whether the WT time effect is that same as the KO time effect. Here, we are looking for genes where the log-fold change between time points is different between genotypes. This test doesn't care about differences in the baseline level of binding (i.e., at time 0) between genotypes, it will only reject if the effect between time points is different between genotypes.

Alternatively, you could do something like:

con <- makeContrasts(wt.0h - ko.0h, wt.2h - ko.2h, wt.4h - ko.4h, levels=design)


... which asks whether there are any differences between WT and KO binding at any time point. Unlike the previous contrast, this will reject the null if there is a difference in the baseline and/or in the magnitude of the time effects between genotypes.

As for your other question: if you're using csaw, you'll be using sliding windows to define genomic regions. Here, there is no concept of a window being "significant" in one sample and not in the others. The most that we do is to filter windows based on their abundance, and this is done using the average count across all samples. This strategy is deliberately blind to the exact differences between conditions, which ensures that we never make choices about which windows to keep or discard based on the presence/absence of binding in different conditions. It means that we don't have to worry about the differences between conditions when defining the genomic regions of interest; this decision is left (correctly) to the hypothesis testing machinery.

Finally, I should add that, depending on the contrast, you may need to normalize differently. If you are comparing ChIP samples to input, you should be normalizing for composition biases. If you are comparing the ChIP samples to each other, you will have to choose between normalization for composition and efficiency biases, depending on the biological system. Check out the user's guide for more details.

I just want to add to Aaron's very thorough answer that in my experience if you have used a peak caller which takes background (input) into account when calling peaks you can generally ignore input in the differential analysis (as it makes almost no difference). If you only quantify known annotated features (TSS or similar) you do however need to take input into account.

Furthermore to get high quality results I would suggest to follow the encode ChIP guidelines and use IDR in the peak calling.

Thanks for the answer and the reference. But I am not sure, what you mean by that.

How can I ignore the Input file, after running a peak caller. I did use MACS2 to call for peaks. But how do I get the counts per Gene/Transcript/TSS after running a peak caller?

I know I can get the counts from a bam file, either by external tools, e.g. featureCounts or by internal R packages such as Qusar

How do you get the counts from the output of MACS2?

thanks

In my experience you can ignore the input if you are doing differential analysis on peaks called with tools such as MACS2 because peaks are defined as regions where there is much more signal than background - and in my experience including background makes no or very little difference.

This is however not the case if you quantify the ChIP-seq signal in already predefined regions - such as those given by annotation. These quantification you woul not get from a peak caller but directly from the quantification of mapped reads.

There is a long history of people using peak-callers to define features, counting reads across those features, and then using those counts in packages such as edgeR. This is a very intuitive approach; however, it is fraught with statistical pitfalls, as you are defining the features with the same data that are used for differential testing. Unless done carefully, this constitutes "data dredging" whereby the process of feature definition biases the hypothesis testing in edgeR. This results in unpredictable effects including inflation of the dispersion estimates and/or loss of type I error control.

Our 2014 paper (https://dx.doi.org/10.1093/nar/gku351) describes some examples of how common peak calling strategies can interfere with the differential analysis in edgeR. This mostly discusses the simplest case where the samples are of the same sequencing depth. I can only imagine that the problems would be exacerbated when dealing with samples of different depth, which will inherently alter the detection power for peaks between samples.

Long story short, if you don't want to read the paper: call peaks on the pooled library of all samples (assuming similar library sizes), if the peaks are to be defined independently of the DB status in a negative binomial model such as edgeR. You can also call peaks on the pooled IP vs pooled input, and then use the resulting peaks to quantify coverage in the IP samples for a differential analysis. This is a relatively sensible use of the input libraries in the DB framework.

Thanks Aaron for the really elaborate answer. So if I understand it correctly, it is ok to use edgeR (and for that case also DESeq) with ChIP-Seq data? I thought it was meant only for RNA-Seq, as the reads are Poisson distributed and not negative binomial. Am I wrong with this assumption?

Another difficulty, which also regards the comment from k.vitting.seerup, is how to apply this contrast to the fact that for each time point I have two "conditions" - The IP and the INPUT duplicates.  Would it be ok to do something like that when creating the contrast matrix(e.g. the first example):

con <- makeContrasts(wt.2h - wt.0h, wt.4h - wt.0h, levels=design)



One can do this

con <- makeContrasts( (wt2h.IP - wt2h.INPUT) - (wt.0h.IP - wt.0h.INPUT), (wt4h.IP - wt4h.INPUT) - (wt.0h.IP - wt.0h.INPUT), levels=design)

Of course, I'll need to modify the meta data table to account for this options:

condition TP Replicate IP_Input term
wt 0h 1 Input wt.0h.INPUT
wt 0h 1 IP wt.0h.IP
wt 0h 2 Input wt.0h.INPUT
wt 0h 2 IP wt.0h.IP
wt 2h 1 Input wt.2h.INPUT
wt 2h 1 IP wt.2h.IP
wt 2h 2 Input wt.2h.INPUT
wt 2h 2 IP wt.2h.IP
wt 4h 1 Input wt.4h.INPUT
wt 4h 1 IP wt.4h.IP
wt 4h 2 Input wt.4h.INPUT
wt 4h 2 IP wt.4h.IP
...
Would this be the correct way to go?
1

Yes, it is okay to use edgeR for ChIP-seq counts. I should also clear up some misconceptions that you seem to have. For starters, I can tell you that ChIP-seq counts are not Poisson-distributed between biological replicates. Technical replicates, perhaps, but there is definitely a non-zero estimate for the dispersion when you model the variability between biological replicates. Secondly, even if the counts were Poisson-distributed, this would not invalidate the NB model used by edgeR, as the Poisson distribution is simply a special case of the negative binomial distribution with a dispersion of zero.

As for the input libraries - I would never use them in a contrast testing for DB between ChIP-seq libraries. Consider:

(wt2h.IP - wt2h.INPUT) - (wt.0h.IP - wt.0h.INPUT)

... which tests whether there is a difference between the IP/input log-fold change between time points 0 and 2. On the surface, this sounds sensible enough, especially when you might expect changes to chromatin state (and accessibility, and therefore input coverage) between time points. However, we often find that this approach results in the loss of many regions that would be detected as DB by a direct comparison between IP libraries. Why? Well, changes in chromatin state are often correlated to changes in binding for your protein of interest, given that the chromatin needs to be nice and open for the protein to bind to the genome in the first place. By performing subtraction of coefficients on the log-scale, any log-fold change in binding in the IP libraries will be cancelled out by a correlated log-fold change in accessibility in the input libraries.

If you are concerned about differences in input coverage biasing your DB results, I would suggest using the GreyListChIP package. Check out a related discussion here: A: DESeq2 for ChIP-seq differential peaks.