Questions about DiffBind Package
Entering edit mode
ucarduygu • 0
Last seen 6.8 years ago


I have two questions about how DiffBind works (listed below).  Could you hep me figure these out. Thank you for your help in advance! 

Best, Duygu Ucar. 

  1. First, why do the correlation plots generated from the same data (no contrast) look very different. For example when I run the below two commands I generated two correlation plots and they are substantially different, first one is in fact what we expect to see. I tried to understand what is causing this difference but I could not figure it out from the description of functions. Could you help me with this ?
    • pbmc.subset <- dba(sampleSheet="ND5_diffbind_v2.csv")
    • pbmc.subset <- dba.count(pbmc.subset,bParallel=TRUE)
  2. In the case of no control samples, which normalization/summary method do you recommend? Are these significantly different ? Do you have any guidelines/insights to follow for different types of datasets? 


raw read count for interval using only reads from ChIP


raw read count for interval from ChIP divided by read count for interval from control


raw read count for interval from ChIP minus read count for interval from control


RPKM for interval using only reads from ChIP


RPKM for interval from ChIP divided by RPKM for interval from control


TMM normalized (using edgeR), using ChIP read counts and Full Library size


TMM normalized (using edgeR), using ChIP read counts and Effective Library size


TMM normalized (using edgeR), using ChIP read counts minus Control read counts and Full Library size


TMM normalized (using edgeR), using ChIP read counts minus Control read counts and Effective Library size


same as DBA_SCORE_TMM_READS_FULL, but reporrted in counts-per-million.


same as DBA_SCORE_TMM_READS_EFFECTIVE, but reporrted in counts-per-million.


same as DBA_SCORE_TMM_MINUS_FULL, but reporrted in counts-per-million.


Tsame as DBA_SCORE_TMM_MINUS_EFFECTIVE, but reporrted in counts-per-million.


summit height (maximum read pileup value)


summit height (maximum read pileup value), normalized to relative library size


summit position (location of maximum read pileup)


diffbind normalization • 2.2k views
Entering edit mode
Rory Stark ★ 4.1k
Last seen 14 days ago
CRUK, Cambridge, UK

Hello Duygu-

For the first question, the correlation heatmap in the two cases is based on potentially quite different data. In the first case, when you read in the sample sheet, you are correlating peak scores, while in the second case, you are correlating scores based directly on read counts. Peak scores are generally some statistic that was generated by the peak caller. As I don't know what your peak files look like, I don't know exactly what the scores are, but any peaks identified somewhere in the experiment, but not called for a specific sample, will have a missing score (set to -1), so the score vectors being correlating may be quite sparse. After invoking dba.count()every sample will have some read count score for every peak, which can yield quite different clustering results.

In many cases, the clustering based on the peak scores vs read scores are very similar. When they differ, in my experience that usually points to an issue in peak calling, where peaks are being called inconsistently. One thing to check is the overall overlap rate; if this curve drops off too quickly (worse than geometric), that indicates that there is little agreement between the peaksets called for each sample:

> pbmc.subset <- dba(sampleSheet="ND5_diffbind_v2.csv")
> rate <-  dba.overlap(pbmc.subset, mode=DBA_OLAP_RATE)
> plot(rate,type='b',xlab="# peaksets",ylab="# common peaks")

For the second question, the control sample isn't as important for DiffBind as it will be for peak calling (if you don't have control samples, this may be why the peak calling is inconsistent above). There are no hard and fast rules for choosing a scoring system. DBA_SCORE_RPKM is generally a good one to try, as this is just the read counts normalized to the library depth and the peak width. I usually use one of the DBA_SCORE_TMM_ scores, which do a more sophisticated normalization (see Robinson and Oshlack, Genome Biology 2010).  In your case, with no control reads, you could use either DBA_SCORE_TMM_READS_FULL, which will normalize to the overall depth of sequencing in each library, or DBA_SCORE_TMM_READS_EFFECTIVE, which normalizes to the number of reads actually overlapping peaks. The latter score should only be used if you expect most of the peaks to have similar binding levels, with a small minority lost or gained.

Note that if you do a differential analysis after counting (via dba.analyze()), and then do plots using that contrast, the score will always be whatever normalization was used for the analysis, which depends on the underlying package (generally edgeR, which does the TMM normalization or DESeq).

Hope this helps-


Entering edit mode
ucarduygu • 0
Last seen 6.8 years ago

hi Rory, 

Thank you for your reply! This is really useful. 

I am still trying to understand why these two heat maps are so different in our samples and I have one quick follow-up question regarding your answer. What value is assigned to the non-significant peaks in samples where the peak has not been called for the confidence score heat map calculation ? Do you assign a default score (such as p=0.05)? Maybe the missing peak scores are driving the correlation patterns. What do you think ?

Thank you!


Entering edit mode

Hello Duygu-

I'm not clear on exactly what the question is, perhaps you could supply a code snippet? In the original question there hadn't been any differential analysis done (no call to dba.analyze()), so I'm not sure what significance scores you are referring to. In the first instance, "missing" peaks for a sample get a score of -1. If you have very different peak calls, with only a few peaks called for  some samples, then they would have a lot of -1 values and could indeed correlate closely. If it turned out that the actual read counts in these regions were not that different than other samples, they could then be more highly correlated (and hence closer) to different samples.




Login before adding your answer.

Traffic: 375 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6