Search
Question: Diffbind - What exactly is a "consensus" peak set?
0
gravatar for wesley.cai
22 months ago by
wesley.cai0
wesley.cai0 wrote:

Hi, I have a question about the definition of consensus peak set. I can think of two options:

A) One single peak set from adding and merging multiple peak sets. akin to: cat 1.bed 2.bed | bedtools merge -stdin > consensus.bed

B) A "pooled" peak set consisting of all the peaks from the two input peak sets. akin to: cat 1.bed 2.bed > consensus.bed

Which one is it?

The reason I ask is because I am trying to decide whether it would be a good idea to perform IDR on my samples and use the single IDR-cutoff peak set as the "consensus" for differential binding analysis. I would do this if A) is the right answer. However, if it is B) then IDR is somewhat redundant, as diffbind would just tell me all of the peaks that are significant. 

Thanks in advance.

-Wes

ADD COMMENTlink modified 22 months ago by Gord Brown570 • written 22 months ago by wesley.cai0
1
gravatar for Gord Brown
22 months ago by
Gord Brown570
United Kingdom
Gord Brown570 wrote:

Hi,

It's A).  Peaks are merged if they overlap by >= 1 bp.  We've talked about an option to keep all peaks, i.e don't merge, but that raises questions about how to count reads that overlap multiple peaks, etc.

Unsolicited advice: the IDR is quite unstable, i.e. different initial parameter choices yield quite different answers.  Li's (the author) advice is to try a variety of parameters to see if there is a common core of 'good' peaks.

Cheers,

 - Gord

ADD COMMENTlink written 22 months ago by Gord Brown570

I see, thanks Gord. I also welcome any advice from you brilliant folks. That's the first I've heard of that, so I'll give it a go on my data. 

This is out of the scope of Diffbind, but do you personally prefer obtaining significant differences through Diffbind without the IDR pipeline? Or do you actually follow Li's advice?

ADD REPLYlink written 22 months ago by wesley.cai0
3

I wanted to add somethings to what Gord said.

The ENCODE standards, including IDR, were developed as part of an effort to identify the locations where binding sites and epigenetic marks are. The focus of this type of "mapping" exercise is on identifying the location of binding sites with high confidence. 

The goals of a differential analysis are different. We are trying to identify genomic intervals where we have confidence that binding levels have changed. A definitive "map" of high-confidence binding sites is not required to accomplish this. The techniques used in DiffBind should be robust to the inclusion of low-confidence binding sites and noise, so long as there are sufficient replicates to properly power the analysis. Only sites that consistently differ in read density across all the replicates in the sample groups should be identified as being differentially bound with high confidence (low FDR). So choosing a "lenient" consensus set, and not worrying too much about getting a perfect set, is fine.

Secondly, regarding merging of peaks that overlap in multiple samples. We do this so that the consensus peaks are unique in the bases they cover, so we can uniquely assign reads when counting. There are some downsides to this. One is that the peak intervals tend to get wider the more samples there are, and wider peaks can include more background which can compromise the analysis. For "punctate" peaks such as transcription factor binding, we recommend re-centering the peaks using the summits parameter in dba.count(). This will identify a consensus "summit" (point of highest coverage) and replace the peak interval with a new one of consistent width centered on the summit. For example, if you specify summits=200, the peak intervals will all be 400bp (200bp upstream and downstream of the summit). 

Another disadvantage of merging (and recentering) is that the consensus peaks can be difficult to relate back to the originally called peaks. The idea is that DiffBind helps identify regions on the genome where we have high confidence that the binding changed; there can then be more detailed analysis of what is going on in these regions (which may involve a complex pattern of enrichment).

-Rory

ADD REPLYlink modified 22 months ago • written 22 months ago by Rory Stark2.5k

This makes a lot of sense, thank you for the insight!

-Wes

ADD REPLYlink written 22 months ago by wesley.cai0
1

Hi,

We (our Bioinformatics Core group) do our best to insist on at least 3 replicates, so IDR is not directly applicable.  Personally I don't use IDR, instead just use DiffBind and the underlying package DESeq2, which does the actual statistical analysis.  I don't think anybody in our group uses IDR routinely. 

Cheers,

 - Gord

ADD REPLYlink written 22 months ago by Gord Brown570

Ah, I see. Thanks for the response! 

-Wes

ADD REPLYlink written 22 months ago by wesley.cai0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 148 users visited in the last hour