Question

DiffBind differential analysis

0

Entering edit mode

narzouni • 0

@28e4c26c

Last seen 2.8 years ago

United States

Hello I have a question about the consensus peakset used in DiffBind. I am using ChipSeq data to find statistically significant differential binding sites between two groups.

After running Diffbind, I don't find any significant differential binding sites. However, if I take the intersection of replicates in each group, and then subtract the peaks found in the intersection of both groups. I still find some peaks that are found in one group but not the other.

Those peaks should be statistically significant, but I am not sure I understand why DiffBind did not find any regions that are significant although some regions exist in one group but not the other after taking the intersection of replicates in each group and subtract them.

I was thinking maybe DiffBind forms a lenient consensus peak set which takes only the peaks that are found in both two groups only without considering the peaks that maybe found in one group but not the other. or It is possible that Diffbind is not counting those regions in the consensus peakset maybe because they are short regions???

I hope you have an explanation why I can still find peaks that are in one group but not the other, and diffbind does not report significance.

I am taking the intersection of peaks in all replicates within each group, so those peaks should be true positives since they appear in every replicate, but still diffbind does not report any significance.

Thank you so much.

DifferentialPeakCalling ChIPanalyser DiffBind chipseqDB • 1.8k views

ADD COMMENT • link updated 2.7 years ago by Rory Stark ★ 5.2k • written 2.8 years ago by narzouni • 0

score 0 · Answer 1 · 2023-02-24

There are a number of reasons why you may see differences between peaks called in difference sample groups (such as peaks unique to one group) and those identified as significantly differentially bound un a quantitative analysis.

On the issue of consensus peaks, I would consider a consensus peakset that includes more peaks, including more peaks that are only identified in a few samples (such as a subset of replicates), to be more lenient, representing a union of possible peaks. By default, DiffBind should result in such an expansive consensus peakset, relying on the statistical analysis to assign low confidence scores to poor peaks rather than eliminating them up front. A more "strict" consensus could be formed by taking an intersection, with the problem that only the peaks seen in all samples -- which may be changing the least -- will be retained.

You can check which peaks are present in the consensus peakset by retrieving them in a GRanges object using dba,.peakset() with bRetrieve=TRUE. Are there genomic intervals corresponding to the peaks you are seeing in each of your sample groups present? Alternatively, instead of using the default consensus peakset (all peaks that are called in at least two of the samples in the experiment), the DiffBind vignette contains a section showing how to more systematically create a consensus peakset along the lines you suggest, by first taking an intersection of the replicates in each sample group, then combining these (union) into a consensus.

So why aren't you seeing the differential peaks you expect? Often the reason is that the experiment down not have enough power to confidently identify differences. This occurs when there is a lot of variance between replicates without having enough replicates to capture their distribution.

You can check for these in a number of ways. After counting reads, look at the clustering (using dba,plotHeatmap()) and PCA (using dba.plotPCA()) to see what is driving the variance in your samples. After performing an analysis, you can retrieve all of the statistics using dba.report() and setting th=1 to see all the consensus peaks. This will show you the relative concentrations of reads in each sample group and the calculated fold change, along with the p-value and FDR confidence scores. You can also set bCounts=TRUE to see the normalized read scores for each peak in each of the samples to see if this corresponds with what you are seeing in your genome browser.

Another reason for not seeing the differential peaks you expect can be an inappropriate normalization, where biological variance of interest is normalized away. You can use the dba.plotMA() function to look at the distributions of reads in the raw data, and compare to how this changes under different normalization schemes.

The bottom line is that using overlaps of called peak is not really a reliable way to detect differences in enrichment patterns. We have shown both that peaks unique to one condition do not necessarily stand up to a robust quantitative analysis, and likewise peaks that are called in all samples across conditions can still have very low FDR scores indicating that their affinity is consistently different between samples in different sample groups.