Question

How to select summit size for histone ChiP-Seq in Diffbind

3

Entering edit mode

urjaswita ▴ 40

@urjaswita-13128

Last seen 6.0 years ago

Hi folks,

I am new to Chip-Seq analysis hence this may be a basic question. I want to do differential binding analysis using DiffBind on H3K4me3 and HeK4Ac histone peaks called using macs2 in two different conditions. I am following the tutorial and in the count step below they use summits size

tamoxifen <- dba.count(tamoxifen, summits=250)

This example is for TF binding I think but, I am not sure what to use here for my data on histone ChiP-Seq. Could someone please suggest me what to use for summits size for histone ChiP-Seq for DiffBind?

Thanks so much!!

Diffbind Histone Chip-Seq Summit size • 6.8k views

ADD COMMENT • link updated 3.7 years ago by Rory Stark ★ 5.2k • written 8.4 years ago by urjaswita ▴ 40

score 8 · Answer 1 · 2017-09-21

For histone marks that have broad enrichment there is no magic number for this analysis. Are you using "broad" peaks (eg from MACS)?

Something to keep in mind is that the peak width doesn't need to encompass the entire enriched area to do a differential binding analysis. Using the summits parameter in this situation should result in a maximally enriched representative region being used for the comparison. So long as the same interval is used for all samples (which DiffBind will do automatically), the analysis should be reasonable, so it doesn't matter that much so long as the value is comfortably within the actual "peak" widths.

My practical advice here is to try the analysis both ways (with and without the summits parameter). To set the summits parameter, you may want to look at your distribution of peak widths and choose a value accordingly, e.g. the minimum or first quartile value.

For example in the sample dataset:

> data(tamoxifen_peaks)
> summary(tamoxifen$binding[,3]-tamoxifen$binding[,2])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  310.0   706.0   830.0   961.4  1073.0  5251.0

you can see that summits=250 results in intervals of 500bp, which is somewhere between the minimum and the first quartile.

-Rory

score 1 · Answer 2 · 2021-06-15

Re-centering around summits isn't really meant to embed an assumption regarding the expected width of enriched areas. Rather, it is there to address issues relating to highly variable peak widths, especially after merging peak calls from multiple replicates. We are trying to lessen the proportion of the peaks that consist mostly of background reads, and hence increase technical variance. The more refined (narrow) the peak boundaries are, the fewer background bases are included, leading to higher confidence in assessing differential enrichment.

The idea is that we don't really need to know the "true" boundaries of enriched areas, but rather work with subsets of those regions that are likely to exhibit enrichment across replicates in at least one sample group, and test those higher-confidence areas for differential enrichment. Remember, when a tool like DiffBind calculates that a certain region is significantly differentially enriched, it is not saying that areas outside of these regions are not differentially enriched.

The important thing is to keep in mind is the scientific purpose of performing the analysis, in particular, what you are doing to do with the regions identified as exhibiting differential enrichment. In many cases, the next step is to annotate these regions and calculate their proximity to known genomic features (such as promoters or gene bodies), or to calculate their proximity to other differentially enriched features (such as histone marks indicating an active enhancer). Often these are then correlated with some other functional feature, such as transcript expression or chromatin looping. For these purposes, it is not necessary to know the precise boundaries of the enrichment, only that there is differential enrichment proximal to features of interest.

If more precision regarding the extent of enrichment (eg. open chromatin) is required to meet the objectives of the study, the re-centred regions identified as differential can be examined across replicates to more precisely determine the enrichment boundaries.

An approach that does not rely on peak calling, such as that used in csaw, makes fewer assumptions about the extent of enriched regions and can test small windows (even down to the base pair level) individually.