How to select summit size for histone ChiP-Seq in Diffbind
2
3
Entering edit mode
urjaswita ▴ 40
@urjaswita-13128
Last seen 18 months ago

Hi folks,

I am new to Chip-Seq analysis hence this may be a basic question. I want to do differential binding analysis using DiffBind on H3K4me3 and HeK4Ac histone peaks called using macs2 in two different conditions. I am following the tutorial and in the count step below they use summits size

tamoxifen <- dba.count(tamoxifen, summits=250)

This example is for TF binding I think but, I am not sure what to use here for my data on histone ChiP-Seq. Could someone please suggest me what to use for summits size for histone ChiP-Seq for DiffBind?

Thanks so much!!

Diffbind Histone Chip-Seq Summit size • 1.6k views
5
Entering edit mode
Rory Stark ★ 4.0k
@rory-stark-5741
Last seen 1 day ago
CRUK, Cambridge, UK

For histone marks that have broad enrichment there is no magic number for this analysis. Are you using "broad" peaks (eg from MACS)?

Something to keep in mind is that the peak width doesn't need to encompass the entire enriched area to do a differential binding analysis. Using the summits parameter in this situation should result in a maximally enriched representative region being used for the comparison. So long as the same interval is used for all samples (which DiffBind will do automatically), the analysis should be reasonable, so it doesn't matter that much so long as the value is comfortably within the actual "peak" widths.

My practical advice here is to try the analysis both ways (with and without the summits parameter). To set the summits parameter, you may want to look at your distribution of peak widths and choose a value accordingly, e.g. the minimum or first quartile value.

For example in the sample dataset:

> data(tamoxifen_peaks)
> summary(tamoxifen$binding[,3]-tamoxifen$binding[,2])
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
310.0   706.0   830.0   961.4  1073.0  5251.0 

you can see that summits=250 results in intervals of 500bp, which is somewhere between the minimum and the first quartile.

-Rory

0
Entering edit mode

Thanks so much Rory!

0
Entering edit mode

Without the summits parameter, I usually find a very strong relationship between the FDR and width for peaks, especially for diffuse data. This indicates that the summits parameter sometimes must be set or risk artificial inflation of peak size/pvalue. However, it feels like setting summits may be forcing an assumption on the data which may not be correct. For example, in ATAC-Seq, it isn't really known how large an open region should be.

Do you have any suggestions for other ways to approach this problem? I am thinking it might be possible to combine an assumption-free approach with DiffBind and still get the benefits of both, but I am not sure exactly how.

0
Entering edit mode
Rory Stark ★ 4.0k
@rory-stark-5741
Last seen 1 day ago
CRUK, Cambridge, UK

Re-centering around summits isn't really meant to embed an assumption regarding the expected width of enriched areas. Rather, it is there to address issues relating to highly variable peak widths, especially after merging peak calls from multiple replicates. We are trying to lessen the proportion of the peaks that consist mostly of background reads, and hence increase technical variance. The more refined (narrow) the peak boundaries are, the fewer background bases are included, leading to higher confidence in assessing differential enrichment.

The idea is that we don't really need to know the "true" boundaries of enriched areas, but rather work with subsets of those regions that are likely to exhibit enrichment across replicates in at least one sample group, and test those higher-confidence areas for differential enrichment. Remember, when a tool like DiffBind calculates that a certain region is significantly differentially enriched, it is not saying that areas outside of these regions are not differentially enriched.

The important thing is to keep in mind is the scientific purpose of performing the analysis, in particular, what you are doing to do with the regions identified as exhibiting differential enrichment. In many cases, the next step is to annotate these regions and calculate their proximity to known genomic features (such as promoters or gene bodies), or to calculate their proximity to other differentially enriched features (such as histone marks indicating an active enhancer). Often these are then correlated with some other functional feature, such as transcript expression or chromatin looping. For these purposes, it is not necessary to know the precise boundaries of the enrichment, only that there is differential enrichment proximal to features of interest.

If more precision regarding the extent of enrichment (eg. open chromatin) is required to meet the objectives of the study, the re-centred regions identified as differential can be examined across replicates to more precisely determine the enrichment boundaries.

An approach that does not rely on peak calling, such as that used in csaw, makes fewer assumptions about the extent of enriched regions and can test small windows (even down to the base pair level) individually.