Diffbind - What exactly is a "consensus" peak set?
1
0
Entering edit mode
wesley.cai • 0
@wesleycai-11862
Last seen 5.4 years ago

Hi, I have a question about the definition of consensus peak set. I can think of two options:

A) One single peak set from adding and merging multiple peak sets. akin to: cat 1.bed 2.bed | bedtools merge -stdin > consensus.bed

B) A "pooled" peak set consisting of all the peaks from the two input peak sets. akin to: cat 1.bed 2.bed > consensus.bed

Which one is it?

The reason I ask is because I am trying to decide whether it would be a good idea to perform IDR on my samples and use the single IDR-cutoff peak set as the "consensus" for differential binding analysis. I would do this if A) is the right answer. However, if it is B) then IDR is somewhat redundant, as diffbind would just tell me all of the peaks that are significant.

-Wes

peakset diffbind • 4.5k views
1
Entering edit mode
Gord Brown ▴ 650
@gord-brown-5664
Last seen 22 months ago
United Kingdom

Hi,

It's A).  Peaks are merged if they overlap by >= 1 bp.  We've talked about an option to keep all peaks, i.e don't merge, but that raises questions about how to count reads that overlap multiple peaks, etc.

Unsolicited advice: the IDR is quite unstable, i.e. different initial parameter choices yield quite different answers.  Li's (the author) advice is to try a variety of parameters to see if there is a common core of 'good' peaks.

Cheers,

- Gord

0
Entering edit mode

I see, thanks Gord. I also welcome any advice from you brilliant folks. That's the first I've heard of that, so I'll give it a go on my data.

This is out of the scope of Diffbind, but do you personally prefer obtaining significant differences through Diffbind without the IDR pipeline? Or do you actually follow Li's advice?

4
Entering edit mode

I wanted to add somethings to what Gord said.

The ENCODE standards, including IDR, were developed as part of an effort to identify the locations where binding sites and epigenetic marks are. The focus of this type of "mapping" exercise is on identifying the location of binding sites with high confidence.

The goals of a differential analysis are different. We are trying to identify genomic intervals where we have confidence that binding levels have changed. A definitive "map" of high-confidence binding sites is not required to accomplish this. The techniques used in DiffBind should be robust to the inclusion of low-confidence binding sites and noise, so long as there are sufficient replicates to properly power the analysis. Only sites that consistently differ in read density across all the replicates in the sample groups should be identified as being differentially bound with high confidence (low FDR). So choosing a "lenient" consensus set, and not worrying too much about getting a perfect set, is fine.

Secondly, regarding merging of peaks that overlap in multiple samples. We do this so that the consensus peaks are unique in the bases they cover, so we can uniquely assign reads when counting. There are some downsides to this. One is that the peak intervals tend to get wider the more samples there are, and wider peaks can include more background which can compromise the analysis. For "punctate" peaks such as transcription factor binding, we recommend re-centering the peaks using the summits parameter in dba.count(). This will identify a consensus "summit" (point of highest coverage) and replace the peak interval with a new one of consistent width centered on the summit. For example, if you specify summits=200, the peak intervals will all be 400bp (200bp upstream and downstream of the summit).

Another disadvantage of merging (and recentering) is that the consensus peaks can be difficult to relate back to the originally called peaks. The idea is that DiffBind helps identify regions on the genome where we have high confidence that the binding changed; there can then be more detailed analysis of what is going on in these regions (which may involve a complex pattern of enrichment).

-Rory

0
Entering edit mode

This makes a lot of sense, thank you for the insight!

-Wes

2
Entering edit mode

Hi,

We (our Bioinformatics Core group) do our best to insist on at least 3 replicates, so IDR is not directly applicable.  Personally I don't use IDR, instead just use DiffBind and the underlying package DESeq2, which does the actual statistical analysis.  I don't think anybody in our group uses IDR routinely.

Cheers,

- Gord

0
Entering edit mode

Ah, I see. Thanks for the response!

-Wes