Question

Learning DiffBind - some basic questions

0

Entering edit mode

Sam ▴ 10

@sam-21502

Last seen 3 months ago

Jerusalem

I am learning the DiffBind package, and I have some questions not covered by the vignette.

1) Is the peak score relevant to affinity analysis, and in what way if so?

2) In what case (generally) one should use an occupancy analysis and not an affinity analysis?

3) The idea of greylists is to exclude area where a disproportionate degree of signal is shown at the control. Should not using the control sample at the peak caller stage be sufficient? If there is a very strong signal in the control, then the peak caller should not declare a peak in that area.

4) What are the supplied control files used for (in affinity analysis)? I understand that they will be used if one performs background normalization; is there another use for them?

DiffBind • 2.7k views

ADD COMMENT • link updated 4.5 years ago by Rory Stark ★ 5.2k • written 4.7 years ago by Sam ▴ 10

1

Entering edit mode

Gord Brown ▴ 670

@gord-brown-5664

Last seen 4.6 years ago

United Kingdom

Q4) Nope, background normalization is about it... subtract the input count from the sample's count, then do differential analysis. But as I mentioned earlier, the purists ;) would prefer that we not do that.

ADD COMMENT • link 4.7 years ago Gord Brown ▴ 670

0

Entering edit mode

5th question) What are the asusumptions being made when setting bFullLibrarySize=TRUE?

As far as I understand, setting this parameter to FALSE, would assume that the majority of the peaks do not not change between conditions. That can be assumed if the difference between conditions is known to cause changes in a small amount of peaks only.

But would setting it to bFullLibrarySize to TRUE make some specific assumptions? Perhaps you could give an example of an experiment where those assumptions would be violated.

ADD REPLY • link 4.7 years ago Sam ▴ 10

0

Entering edit mode

The bFullLibrarySize itself parameter has been removed and is now integrated into the new function dba.normalize().

The Normalization section of the vignette discusses the impact of different library sizes on normalization. In most cases, we recommend using a set reference reads other than those in the binding matrix (reads overlapping consensus peaks), such as full library sizes, background bins, or spike-in reads.

ADD REPLY • link 4.7 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Some more questions:

Is it generally correct to say that the results of running DiffBind with summits=FALSE are expected to have less differentially bound peaks?

As you have explained here, by using the entire region of enrichment, one includes more bases that are not truly enriched, leading to more noise. More noise should lead to less positive results in a good experimental system.

Is it correct to say that all else being equal, it is best to choose summits=FALSE (to get more precise boundaries of enrichment)?

( In my specific case, with broad peaks, when I run DiffBind with summits=500, I get 330 peaks. When I run it with summits=FALSE, I get 360 peaks, with 300 peaks shared between the two runs. )

In DiffBind's v3, the default parameter of score, in dba.count is DBA_SCORE_NORMALIZED. Does that mean that the same normalization function is used by default for dba.count and dba.analyze?

DESeq2 assumes a negative binomial distribution. Has it been tested that peaks do follow that distribution, making it suitable to use for DiffBind?

ADD REPLY • link 4.6 years ago Sam ▴ 10

1

Entering edit mode

It is not necessarily the case that there will be fewer peaks when summits=FALSE. What is happening in your case is that you are specifying quite broad peaks (1001bp wide intervals), so they are more likely to overlap each other and get merged into a single interval. When they get merged, they get even wider, so they could include even more "background" noise.

In general, we are not aiming to "get more precise boundaries of enrichment" in the sense that we are not aiming to accurately delineate the enrichment boundaries. Rather, we are aiming to settle on intervals that have representative enrichment. In this case, we prefer to have narrower peaks taken form the middle of an enriched region to "represent" that region, rather than using over-wide regions that include background noise. To obtain this, we prefer to re-center around summits and trim.

In general, when we think about the goals of our biological experiment, we rarely need to have the precise boundaries of enriched regions,. Rather, we need to have confidence in the intervals identified as demonstrating differential enrichment. Typically these are mapped to biological features -- enhancers, promoters, etc -- and then those features are used for downstream analysis.

The score used in dba.count() is only used for certain plotting and reporting functions, and does not impact the analysis. The analysis depends on how the data are normalized.

Early on, when edgeR and the original DESeq were developed, there was some work done to show that the negative binomial was appropriate for modelling most sequencing count data, including specifically ChIP data (cf McCarthy, Chen, and Smyth, 2012), but I'm not aware of subsequent systematic analysis in this area.

ADD REPLY • link 4.6 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Rory Stark Gord Brown

It is not necessarily the case that there will be fewer peaks when summits=FALSE. What is happening in your case is that you are specifying quite broad peaks (1001bp wide intervals), so they are more likely to overlap each other and get merged into a single interval. When they get merged, they get even wider, so they could include even more "background" noise.

To what quartile of the peaks' length distribution is it reasonable to set the interval (to control for the widening effect, yet not to merge nearby peaks) ?

On a different subject :

After reading all peaks with dba function, it is possible immediately to create a correlation heatmap (even before counting all reads in consensus peaks). What are the values for each sample between which the correlation is calculated? In the vignette it says "Figure 1: Correlation heatmap, using occupancy (peak caller score) data". But the number of peaks, and hence the number of peak caller scores is different for each samples. Afaik, to calculate a correlation you need vectors of the same length. Is it possible to view the matrix on which the correlation is calculated?
Generally speaking, why is the default FDR cutoff in DESeq2 analysis in DiffBind is 0.05, and not DESeq2 "native" default value 0.1 ?

ADD REPLY • link 4.5 years ago Sam ▴ 10

0

Entering edit mode

I have found an answer by Rory Stark to the first question here

The parameter should be between the minimum and Q1 (of the length of the consensus peaks).

The answer to the second question was found here

ADD REPLY • link 4.5 years ago Sam ▴ 10

0

Entering edit mode

When the peaks are first read in (prior to counting), a default binding matrix is created by merging peaks. For each consensus peak, if it was called in an overlapping peak for a given sample, the score (normalized to 0..1) for that peak is used as the value; otherwise, it is set to -1.The columns of the matrix form the vectors that are correlated for the heatmap.
Generally speaking, the defaults in DiffBind are set on the conservative side, leaving the user to make specific decisions to apply looser thresholds.

ADD REPLY • link 4.5 years ago Rory Stark ★ 5.2k

1

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 7 months ago

Cambridge, UK

Starting in DiffBind_3.0, by default any supplied controls are used to generate greylists and filter peaks. If greylists are not generated, then the control reads are subtracted when counting.

ADD COMMENT • link 4.7 years ago Rory Stark ★ 5.2k

score 2 · Accepted Answer · 2020-12-21

Regarding question 3: Peak callers can get very confused in regions where there is strong signal in the input/control samples, and stronger signal in the experiment samples. For example, MACS will call all sorts of spurious peaks in, for example, regions of chr17 and chr20 in MCF7 cells, because that cell line has large repeat expansions in those chromosomes, hence large pileups in input and experiment. This was, in fact, exactly why GreyListChIP was written. (Disclosure: I am the author of GreyListChIP ;) ) After removal of reads in the grey-list regions, peak calling is much more sane, with much less noise.

In addition, various people (including the authors of DESeq2, I believe) have suggested that it is statistically more rigourous to filter with grey lists first, then peak call, then do the differential analysis without subtracting the input, than it is if subtracting input then doing the DE analysis.

Regarding question 2: The only real reason to use an occupancy analysis is the case where you don't have replicates to perform an affinity analysis.

Regarding question 1: By the time you get to affinity analysis, peak scores are not used, only counts of reads.

Hope this helps.