I am learning the DiffBind package, and I have some questions not covered by the vignette.
1) Is the peak score relevant to affinity analysis, and in what way if so?
2) In what case (generally) one should use an occupancy analysis and not an affinity analysis?
3) The idea of greylists is to exclude area where a disproportionate degree of signal is shown at the control. Should not using the control sample at the peak caller stage be sufficient? If there is a very strong signal in the control, then the peak caller should not declare a peak in that area.
4) What are the supplied control files used for (in affinity analysis)? I understand that they will be used if one performs background normalization; is there another use for them?
5th question) What are the asusumptions being made when setting bFullLibrarySize=TRUE?
As far as I understand, setting this parameter to FALSE, would assume that the majority of the peaks do not not change between conditions. That can be assumed if the difference between conditions is known to cause changes in a small amount of peaks only.
But would setting it to bFullLibrarySize to TRUE make some specific assumptions? Perhaps you could give an example of an experiment where those assumptions would be violated.
The
bFullLibrarySize
itself parameter has been removed and is now integrated into the new functiondba.normalize()
.The
Normalization
section of the vignette discusses the impact of different library sizes on normalization. In most cases, we recommend using a set reference reads other than those in the binding matrix (reads overlapping consensus peaks), such as full library sizes, background bins, or spike-in reads.Some more questions:
As you have explained here, by using the entire region of enrichment, one includes more bases that are not truly enriched, leading to more noise. More noise should lead to less positive results in a good experimental system.
( In my specific case, with broad peaks, when I run DiffBind with summits=500, I get 330 peaks. When I run it with summits=FALSE, I get 360 peaks, with 300 peaks shared between the two runs. )
In DiffBind's v3, the default parameter of score, in dba.count is DBA_SCORE_NORMALIZED. Does that mean that the same normalization function is used by default for dba.count and dba.analyze?
It is not necessarily the case that there will be fewer peaks when
summits=FALSE
. What is happening in your case is that you are specifying quite broad peaks (1001bp wide intervals), so they are more likely to overlap each other and get merged into a single interval. When they get merged, they get even wider, so they could include even more "background" noise.In general, we are not aiming to "get more precise boundaries of enrichment" in the sense that we are not aiming to accurately delineate the enrichment boundaries. Rather, we are aiming to settle on intervals that have representative enrichment. In this case, we prefer to have narrower peaks taken form the middle of an enriched region to "represent" that region, rather than using over-wide regions that include background noise. To obtain this, we prefer to re-center around summits and trim.
In general, when we think about the goals of our biological experiment, we rarely need to have the precise boundaries of enriched regions,. Rather, we need to have confidence in the intervals identified as demonstrating differential enrichment. Typically these are mapped to biological features -- enhancers, promoters, etc -- and then those features are used for downstream analysis.
The
score
used indba.count()
is only used for certain plotting and reporting functions, and does not impact the analysis. The analysis depends on how the data are normalized.Early on, when
edgeR
and the originalDESeq
were developed, there was some work done to show that the negative binomial was appropriate for modelling most sequencing count data, including specifically ChIP data (cf McCarthy, Chen, and Smyth, 2012), but I'm not aware of subsequent systematic analysis in this area.Rory Stark Gord Brown
On a different subject :
After reading all peaks with dba function, it is possible immediately to create a correlation heatmap (even before counting all reads in consensus peaks). What are the values for each sample between which the correlation is calculated? In the vignette it says "Figure 1: Correlation heatmap, using occupancy (peak caller score) data". But the number of peaks, and hence the number of peak caller scores is different for each samples. Afaik, to calculate a correlation you need vectors of the same length. Is it possible to view the matrix on which the correlation is calculated?
Generally speaking, why is the default FDR cutoff in DESeq2 analysis in DiffBind is 0.05, and not DESeq2 "native" default value 0.1 ?
I have found an answer by Rory Stark to the first question here
The parameter should be between the minimum and Q1 (of the length of the consensus peaks).
The answer to the second question was found here
When the peaks are first read in (prior to counting), a default binding matrix is created by merging peaks. For each consensus peak, if it was called in an overlapping peak for a given sample, the score (normalized to 0..1) for that peak is used as the value; otherwise, it is set to -1.The columns of the matrix form the vectors that are correlated for the heatmap.
Generally speaking, the defaults in
DiffBind
are set on the conservative side, leaving the user to make specific decisions to apply looser thresholds.