Hi there,
I want to use DESeq2 for differential peak detection for ChIP-seq data.
I know there are many methods implemented for this purpose https://github.com/crazyhottommy/ChIP-seq-analysis#differential-peak-detection
Diffbind internally uses DESeq2 and EdgR, but I want to take the other way:
Say I have untreat and treat group for my ChIP-seq data, each with three replicates.
for ChIP-seq, we usually include a input control, which is just genomic DNA without pulling down with a specific antibody.
I call peaks with MACS2 for all of them, and merge all the peaks into a super-set. https://github.com/crazyhottommy/ChIP-seq-analysis/blob/master/part1_peak_calling.md#merge-peaks
Now, I can count from the bam files for each replicate and input, how many reads in each peak. https://github.com/crazyhottommy/ChIP-seq-analysis/blob/master/part2_Preparing-ChIP-seq-count-table.md
A final count table will look like this:
lib size 10million 10million 10 million 20 million 12 million 12 million 12 million 24 million
untreat_1 untreat_2 untreat_3 untreat_input_control treat_1 treat_2 treat_3 treat_input_control
peak_1 20 25 30 3 10 12 13 5
peak_2 10 28 13 8 64 43 45 7
........
How should I use this raw count table with DESeq2?
I should subtract input_control raw counts from the IP counts ( but I will need to normalize to the lib size first?),
but then I will get decimals which violates the DESeq2 assumption https://www.biostars.org/p/143458/#157303
Thanks !
Ming Tang
To add to the other answers, you should definitely read the casw paper before continuing, especially the section about strategies for combining peak calls across samples. The paper shows that the taking the union of per-sample peak calls is not a good strategy and will result in loss of accurate FDR control. This finding applies regardless of whether or not you actually use the csaw package for your analysis.
I'm not sure that's what they are saying. My understanding is that the FDR issue relates not to merging peak calls from different samples (prior to a differential analysis), but to merging peaks that have been identified as differentially bound. The confidence statistics associated with each differentially bound region can not be easily combined. This is actually more of an issue, requiring a specific solution, in an approach like csaw where windows are tested for differential counts, then clustered/merged into larger differentially bound regions.
The issue in the paper seems to boil down to expanding peak width that comes with taking unions. The wider the peaks, the more background reads they include which raises the FDR. In current versions of DiffBind, a recenter-around-summits strategy is used to control the width of the consensus peaks, which takes care of this problem.
Potentially, the more potent concern in a peak-based approach is that the same information that is used to identify the peak regions is re-used to identify differences in binding affinity within those regions. I'm not sure I agree that this is a problem in practice (or if it is it is not fully sidestepped by a windowing approach where many windows are filtered out), but it is a more applicable observation.
Your last paragraph is exactly the issue I was referring to. I'm referring specifically to tables 1 and 2 in the csaw paper, and the accompanying text sections "Description of the peak-based strategy", "Many peak-based implementations are possible", and especially "Peaks should be called from pooled libraries". The question above describes a peak calling method that corresponds to method 1 in the csaw paper, which results in highly over-conservative differential binding calls.
Thought I might chip in here (no pun intended), as that paper sounds somewhat familiar. In short, Ryan's interpretation is mostly correct. Using what I call the "at least 2" strategy can be conservative (as shown in the 2014 NAR paper) but it can also be liberal (as shown in the 2015 NAR paper). To explain this behaviour, consider the peaks that are just detected, i.e., are called in exactly two libraries. If these two peaks belong in the same group, you enrich for spurious DB, which results in liberalness. If they belong in different groups, you enrich for peaks with large dispersions, which results in conservativeness. Whether it does one or the other can't be easily predicted in a given data set. This makes the "at least 2" approach a bit unpalatable, as its statistical properties are not well defined.
Of course, this really only matters for the peaks near the limit of detection. Strong peaks that are far away from the detection limit should be fine no matter what you do (with caveats, due to information sharing across peaks in
edgeR
). That said, as method developers, we earn our money by squeezing blood out of rocks - uh, I mean, getting more information out of the data, including that for weak peaks near the detection limit. If you want reliable statistical inferences for DB in these peaks, then the "at least 2" approach will have some shortcomings.Thanks, I am following Dr. Gordon K. Smyth's tutorial http://f1000research.com/articles/4-1080/v1