Question

Issue carrying out DiffBind Analysis at a predefined peak set

0

Entering edit mode

doherta6 • 0

@e3a2d491

Last seen 18 months ago

Ireland

Hi, I am trying to look at differential binding of a number of chromatin regulatory proteins via CutandRun only at a predefined set of peaks (Approx. 5000 peaks).

When I carry out the analysis for different proteins at the same predefined peaks I get a different number of consensus peaks which is always less than 5000. I assumed the analysis would be carried out with the entire predefined peak-set as the consensus peaks and the number would stay the same. However, this does not seem to be the case. If you could provide some clarity on what is happening that would be much appreciated.

I am supplying my predefined peaks as a bed file at the sample file generation stage under the peaks column for each sample (3 replicates & 2 conditions)

Thank you in advance,

Anthony

DiffBind • 739 views

ADD COMMENT • link updated 18 months ago by Rory Stark ★ 5.1k • written 18 months ago by doherta6 • 0

score 0 · Answer 1 · 2022-09-29

Assuming you are using default parameter values, there are three aspects of the processing that may alter the number of peaks:

If the peakset you are passing in includes intervals that overlap (by at least one basepair), these will be merged into a single wider peak. This is most likely not what is happening in your case, as you are using the same peakset for each of the comparisons but ending up with a different number of intervals.
In the dba.count() phase, when the peaks are re-centered around the summit, it is possible that peaks that didn’t overlap initially overlap after counting reads and are merged. For example, if the primary point of enrichment for a protein is located at the upstream edge of one peak and the downstream edge of an adjacent peak, they may overlap after extending the window according to the value of the summits parameter. This shouldn’t happen very often and is probably not what is driving the difference in peak numbers you are seeing. You can look for this effect by setting summits=FALSE.
Also in the dba.count() phase, a filter is applied by default to remove peak intervals with very low enrichment across all the samples. This is most likely the culprit in your case, if there are some proteins with enrichment in fewer of the pre-defined peak regions. You can test this by setting filter=0 to eliminate the filtering.