Re-centering around summits isn't really meant to embed an assumption regarding the expected width of enriched areas.
Rather, it is there to address issues relating to highly variable peak widths, especially after merging peak calls from multiple replicates.
We are trying to lessen the proportion of the peaks that consist mostly of background reads, and hence increase technical variance.
The more refined (narrow) the peak boundaries are, the fewer background bases are included, leading to higher confidence in assessing differential enrichment.
The idea is that we don't really need to know the "true" boundaries of enriched areas, but rather work with subsets of those regions that are likely to exhibit enrichment across replicates in at least one sample group, and test those higher-confidence areas for differential enrichment.
Remember, when a tool like DiffBind
calculates that a certain region is significantly differentially enriched, it is not saying that areas outside of these regions are not differentially enriched.
The important thing is to keep in mind is the scientific purpose of performing the analysis, in particular, what you are doing to do with the regions identified as exhibiting differential enrichment.
In many cases, the next step is to annotate these regions and calculate their proximity to known genomic features (such as promoters or gene bodies), or to calculate their proximity to other differentially enriched features (such as histone marks indicating an active enhancer).
Often these are then correlated with some other functional feature, such as transcript expression or chromatin looping.
For these purposes, it is not necessary to know the precise boundaries of the enrichment, only that there is differential enrichment proximal to features of interest.
If more precision regarding the extent of enrichment (eg. open chromatin) is required to meet the objectives of the study, the re-centred regions identified as differential can be examined across replicates to more precisely determine the enrichment boundaries.
An approach that does not rely on peak calling, such as that used in csaw
, makes fewer assumptions about the extent of enriched regions and can test small windows (even down to the base pair level) individually.
Thanks so much Rory!
Without the
summits
parameter, I usually find a very strong relationship between theFDR
andwidth
for peaks, especially for diffuse data. This indicates that thesummits
parameter sometimes must be set or risk artificial inflation of peak size/pvalue. However, it feels like settingsummits
may be forcing an assumption on the data which may not be correct. For example, in ATAC-Seq, it isn't really known how large an open region should be.Do you have any suggestions for other ways to approach this problem? I am thinking it might be possible to combine an assumption-free approach with
DiffBind
and still get the benefits of both, but I am not sure exactly how.