Question: DESeq2 for ChIP-seq differential peaks
gravatar for tangming2005
2.2 years ago by
United States
tangming200590 wrote:

Hi there,

I want to use DESeq2 for differential peak detection for ChIP-seq data.

I know there are many methods implemented for this purpose

Diffbind internally uses DESeq2 and EdgR, but I want to take the other way:

Say I have untreat and treat group for my ChIP-seq data, each with three replicates.

for ChIP-seq, we usually include a input control, which is just genomic DNA without pulling down with a specific antibody.

I call peaks with MACS2 for all of them, and merge all the peaks into a super-set.

Now, I can count from the bam files for each replicate and input, how many reads in each peak. 

A final count table will look like this:

lib size       10million   10million  10 million     20 million               12 million   12 million 12 million    24 million 

                untreat_1  untreat_2 untreat_3 untreat_input_control  treat_1 treat_2 treat_3 treat_input_control

peak_1     20            25              30                   3                             10              12      13             5                             

peak_2     10            28              13                   8                              64             43       45           7


How should I use this raw count table with DESeq2?  

I should subtract input_control raw counts from the IP counts ( but I will need to normalize to the lib size first?),

but then I will get decimals which violates the DESeq2 assumption

Thanks !

Ming Tang




ADD COMMENTlink modified 4 months ago by k.vitting.seerup10 • written 2.2 years ago by tangming200590

To add to the other answers, you should definitely read the casw paper before continuing, especially the section about strategies for combining peak calls across samples. The paper shows that the taking the union of per-sample peak calls is not a good strategy and will result in loss of accurate FDR control. This finding applies regardless of whether or not you actually use the csaw package for your analysis.

ADD REPLYlink modified 23 months ago • written 23 months ago by Ryan C. Thompson6.1k

I'm not sure that's what they are saying. My understanding is that the FDR issue relates not to merging peak calls from different samples (prior to a differential analysis), but to merging peaks that have been identified as differentially bound. The confidence statistics associated with each differentially bound region can not be easily combined. This is actually more of an issue, requiring a specific solution, in an approach like csaw where windows are tested for differential counts, then clustered/merged into larger differentially bound regions.

The issue in the paper seems to boil down to expanding peak width that comes with taking unions. The wider the peaks, the more background reads they include which raises the FDR. In current versions of DiffBind, a recenter-around-summits strategy is used to control the width of the consensus peaks, which takes care of this problem.

Potentially, the more potent concern in a peak-based approach is that the same information that is used to identify the peak regions is re-used to identify differences in binding affinity within those regions. I'm not sure I agree that this is a problem in practice (or if it is it is not fully sidestepped by a windowing approach where many windows are filtered out), but it is a more applicable observation.

ADD REPLYlink modified 23 months ago • written 23 months ago by Rory Stark2.1k

Your last paragraph is exactly the issue I was referring to. I'm referring specifically to tables 1 and 2 in the csaw paper, and the accompanying text sections "Description of the peak-based strategy", "Many peak-based implementations are possible", and especially "Peaks should be called from pooled libraries". The question above describes a peak calling method that corresponds to method 1 in the csaw paper, which results in highly over-conservative differential binding calls.

ADD REPLYlink written 23 months ago by Ryan C. Thompson6.1k

Thought I might chip in here (no pun intended), as that paper sounds somewhat familiar. In short, Ryan's interpretation is mostly correct. Using what I call the "at least 2" strategy can be conservative (as shown in the 2014 NAR paper) but it can also be liberal (as shown in the 2015 NAR paper). To explain this behaviour, consider the peaks that are just detected, i.e., are called in exactly two libraries. If these two peaks belong in the same group, you enrich for spurious DB, which results in liberalness. If they belong in different groups, you enrich for peaks with large dispersions, which results in conservativeness. Whether it does one or the other can't be easily predicted in a given data set. This makes the "at least 2" approach a bit unpalatable, as its statistical properties are not well defined.

Of course, this really only matters for the peaks near the limit of detection. Strong peaks that are far away from the detection limit should be fine no matter what you do (with caveats, due to information sharing across peaks in edgeR). That said, as method developers, we earn our money by squeezing blood out of rocks - uh, I mean, getting more information out of the data, including that for weak peaks near the detection limit. If you want reliable statistical inferences for DB in these peaks, then the "at least 2" approach will have some shortcomings.

ADD REPLYlink modified 23 months ago • written 23 months ago by Aaron Lun17k

Thanks, I am following Dr. Gordon K. Smyth's tutorial

ADD REPLYlink modified 23 months ago • written 23 months ago by tangming200590
gravatar for Michael Love
2.2 years ago by
Michael Love15k
United States
Michael Love15k wrote:


You should definitely never subtract or divide the counts you provide to DESeq2.

I would not use the input counts at all for differential binding. I would just compare treated counts vs untreated ChIP counts.

but I would also recommend to also take a look at DiffBind and csaw vignettes and workflows, at the least to understand the best practices they've set out.


ADD COMMENTlink written 2.2 years ago by Michael Love15k

From what I can see, there's two choices; either we get erroneous DB calls because of differences in chromatin state and input coverage, or we get errors due to distorted modelling of the mean-variance relationship after input subtraction. Our (i.e., the edgeR development team's) opinion is that the latter is more dangerous for routine analyses of ChIP-seq count data. Inaccurate variance modelling will affect the inferences for every genomic region via EB shrinkage, while spurious calls due to differences in chromatin state should be limited to a smaller number of regions. Rory's blacklist idea might be useful here, as we would be able to screen out such problematic regions before they cause any headaches during interpretation.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Aaron Lun17k

While it is certainly the case that altering the read counts using control reads violates an essential assumption underlying DESeq2 (namely the use of unadulterated read counts), ignoring the control tracks can also lead to incorrect results. This is because the binding matrix may include read counts for enriched regions (peaks) in samples where they were not originally identified as enriched compared to the control. As DESeq2 will have no way of detecting an anomaly in the Input control for that sample in that region, the results may be misleading. This is most likely to occur in experiments involving different cell types.

There are alternative ways to incorporate the Input controls. For example, instead of scaling the control read counts, the control libraries can be down-sampled to the same level as each corresponding ChIP sample prior to peak calling and counting. This is what we do in our processing pipelines. This still involves altering the ChIP read counts via subtraction however, and in practice down-sampling and scaling almost always yield the same results.

The other method is the more aggressive use of blacklists. Generating blacklists based on every Input control, and removing reads/peaks from every sample that overlap any blacklisted area, can eliminate false positives in those regions where there is an anomaly in an Input control.  Gordon Brown developed the GreyListChIP package for this purpose.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Rory Stark2.1k

Thanks all for your insight! I now have a much better understanding of the assumptions behind DESeq and EdgR.


ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by tangming200590
gravatar for Rory Stark
2.2 years ago by
Rory Stark2.1k
CRUK, Cambridge, UK
Rory Stark2.1k wrote:

I'm not sure what you mean by wanting to "take it the other way"? Meaning, why not use the DiffBind package, which does exactly this type of analysis? Is the idea is that you want to really control each step yourself? 

If you want to subtract the control reads from the ChIP reads, you should do a simple scaling first. Rounding the nearest integer avoids the DESeq2 issue (you also have to check for negative values as it may be possible there there re more reads in the control than in the ChIP for some merged consensus peaks for some samples). DiffBind does this by default. In your case, it would multiply the control read counts by the ratio of ChIP:Control reads (computed individually for each sample) -- 0.50 in all of your examples -- and round the result before subtracting.

In an experiment  like you describe, where the same tissue type is used for all the samples, subtracting the control reads shouldn't make much difference to the results as the same control is used for all the samples in each group, unless there are significant difference between the Inputs for each sample group. This should only happen if the treatment had a big impact the open chromatin (or be the result of a technical issue in the ChIP). The Input controls are most important for the MACS peak calling step.


ADD COMMENTlink written 2.2 years ago by Rory Stark2.1k

Thanks Rory for your comment. The reason is that I have a lot of bam files(hundreds..). using DiffBind/R might be too memory demanding. So, I want to prepare the count table like RNA-seq first and then feed into DESeq2.

As you suggested, I will need to scale to library size first (round the number before subtracting), and then subtract IP reads from the input reads(taking care of negative numbers), then feed the table to DESeq2.

Thanks again!



ADD REPLYlink written 2.2 years ago by tangming200590

Hi Rory,

I want to know the counts from below are the raw counts or counts rounded after scaling to library size for both IP and Input.

In other words, `Reads` are from IP (before or after scaling to lib size?)

The same question for `cReads`.




    Chr   Start     End Score      RPKM Reads     cRPKM cReads

1  chr1  752472  759042   123 0.9810335   199 0.5289105     76

2  chr1  761042  762921    39 0.8101533    47 0.1946692      8

3  chr1  845566  847753    47 1.2143984    82 0.7317340     35

4  chr1  850918  861812   181 1.1000451   370 0.7932470    189

5  chr1  870871  872043    22 1.2159653    44 0.8582802     22

6  chr1  946879  951106   161 1.7163739   224 0.6814630     63

7  chr1  953846  963471   190 1.1273018   335 0.6888129    145

8  chr1  966517  978024    88 0.7205664   256 0.6675460    168

9  chr1  993603 1000013    77 0.8690936   172 0.6776408     95

10 chr1 1002011 1005956    21 0.8128012    99 0.9040274     78



ADD REPLYlink written 23 months ago by tangming200590

Hi Rory, could you please confirm that the Reads and cReads are lib size scaled and rounded? see my question above.


ADD REPLYlink written 23 months ago by tangming200590

No, these are true, raw counts (ChIP and Control respectively) that are not scaled or normalized in any way.

RPKM and cRPKM are normalized to library depth and peak width.

If you want the TMM normalized read counts, use dba.count() with score equal to one of the score IDs that start with DBA_SCORE_TMM_.The the normalized values will be present in the Score column of the peaks. You can change the score without re-counting reads by setting peaks=NULL, for example:

> H3K27ac_TMMreadsFull <- dba.count(H3K27ac_InputSubtract, peaks=NULL,
> H3K27ac_TMMreadsFull$peaks$Score[1:10]
> H3K27ac_TMMsubtractEffective <- dba.count(H3K27ac_InputSubtract, peaks=NULL,
> H3K27ac_TMMsubtractEffectivel$peaks$Score[1:10]

Rather than access $peaks directly, you can get all the scores in one go using dba.peakset():

> H3K27ac_TMMreadsFullScores <- dba.peakset(H3K27ac_TMMreadsFull, bRetrieve=TRUE)



ADD REPLYlink modified 23 months ago • written 23 months ago by Rory Stark2.1k

Thanks Rory for the clarification.


ADD REPLYlink written 23 months ago by tangming200590

How can I get the TMM values for the Input control samples?  The reason is that I want to get the TMMs for all the IP and ChIP samples and plot a PCA to see whether my IP and ChIP are mislabeled.( It looks the case for one of my samples)



ADD REPLYlink written 23 months ago by tangming200590

I've had to do this before as well. Once way is to run the ChIPQC package and get a report, as that specifically looks at enrichment in Input controls compared to corresponding ChIPs.

The other way is to create a new samplesheet that has extra rows for each of your Input controls. You can supply a consensus peakset for these (or all) samples, then read them in with dba(), run dba.count(), and then do the desired plots.


ADD REPLYlink written 23 months ago by Rory Stark2.1k
gravatar for Ryan C. Thompson
2.2 years ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson6.1k wrote:

You might want to take at look at the csaw package, which adapts the edgeR method with all the necessary modifications for unbiased ChIP-Seq differential binding analysis. Also, I would not recommend subtracting input from ChIP counts, since all the count-based methods assume that you are analyzing absolute counts, not "counts in excess of background".

ADD COMMENTlink written 2.2 years ago by Ryan C. Thompson6.1k

interesting, I am waiting for @mikelove for his comments.

ADD REPLYlink written 2.2 years ago by tangming200590
gravatar for k.vitting.seerup
4 months ago by
European Union
k.vitting.seerup10 wrote:

Is there a good reason why nobody have suggested incorporating the input data into the linear model @mikelove @aaronlun?

My idea is to make a model that includes both the input and chip libraries and then test the interaction between chip and condition.

More specifically for the following R code example (using knock down (kd) as treatment): 

    myDesing <- data.frame(sample=1:8, data=c(rep('Input',4), rep('Chip',4)), condition=c('Wt','Kd'))
    myDesing$condition <- factor(myDesing$condition, levels=c('Wt','Kd'))
    myDesing$data <- factor(myDesing$data, levels=c('Input','Chip'))

> model.matrix(data=myDesing, ~data + condition)
  (Intercept) dataChip conditionKd
1           1        0           0
2           1        0           1
3           1        0           0
4           1        0           1
5           1        1           0
6           1        1           1
7           1        1           0
8           1        1           1

You could just test the interaction dataChip::conditionKd with edgeR or DESeq2?

Does this not correspond to measuring the fold change of Input vs WT ( FCiwt ) and the fold change of Input vs KD (FCikd) and then comparing whether there is a difference between FCiwt and FCikd?

ADD COMMENTlink written 4 months ago by k.vitting.seerup10

This should really belong as a separate question (which I will answer, of course).

ADD REPLYlink modified 4 months ago • written 4 months ago by Aaron Lun17k

I have now raised this as a separate question Modeling input data in ChIP-seq experiments with EdgeR, DESeq(2) or Limma.

ADD REPLYlink written 4 months ago by k.vitting.seerup10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 237 users visited in the last hour