0
5.0 years ago by
ucarduygu0 wrote:

Hi,

I have two questions about how DiffBind works (listed below).  Could you hep me figure these out. Thank you for your help in advance!

Best, Duygu Ucar.

1. First, why do the correlation plots generated from the same data (no contrast) look very different. For example when I run the below two commands I generated two correlation plots and they are substantially different, first one is in fact what we expect to see. I tried to understand what is causing this difference but I could not figure it out from the description of functions. Could you help me with this ?
• pbmc.subset <- dba(sampleSheet="ND5_diffbind_v2.csv")
• pbmc.subset <- dba.count(pbmc.subset,bParallel=TRUE)
2. In the case of no control samples, which normalization/summary method do you recommend? Are these significantly different ? Do you have any guidelines/insights to follow for different types of datasets?

normalization diffbind • 1.9k views
modified 4.9 years ago • written 5.0 years ago by ucarduygu0
1
5.0 years ago by
Rory Stark3.0k
CRUK, Cambridge, UK
Rory Stark3.0k wrote:

Hello Duygu-

For the first question, the correlation heatmap in the two cases is based on potentially quite different data. In the first case, when you read in the sample sheet, you are correlating peak scores, while in the second case, you are correlating scores based directly on read counts. Peak scores are generally some statistic that was generated by the peak caller. As I don't know what your peak files look like, I don't know exactly what the scores are, but any peaks identified somewhere in the experiment, but not called for a specific sample, will have a missing score (set to -1), so the score vectors being correlating may be quite sparse. After invoking dba.count()every sample will have some read count score for every peak, which can yield quite different clustering results.

In many cases, the clustering based on the peak scores vs read scores are very similar. When they differ, in my experience that usually points to an issue in peak calling, where peaks are being called inconsistently. One thing to check is the overall overlap rate; if this curve drops off too quickly (worse than geometric), that indicates that there is little agreement between the peaksets called for each sample:

> pbmc.subset <- dba(sampleSheet="ND5_diffbind_v2.csv")
> rate <-  dba.overlap(pbmc.subset, mode=DBA_OLAP_RATE)
> plot(rate,type='b',xlab="# peaksets",ylab="# common peaks")

For the second question, the control sample isn't as important for DiffBind as it will be for peak calling (if you don't have control samples, this may be why the peak calling is inconsistent above). There are no hard and fast rules for choosing a scoring system. DBA_SCORE_RPKM is generally a good one to try, as this is just the read counts normalized to the library depth and the peak width. I usually use one of the DBA_SCORE_TMM_ scores, which do a more sophisticated normalization (see Robinson and Oshlack, Genome Biology 2010).  In your case, with no control reads, you could use either DBA_SCORE_TMM_READS_FULL, which will normalize to the overall depth of sequencing in each library, or DBA_SCORE_TMM_READS_EFFECTIVE, which normalizes to the number of reads actually overlapping peaks. The latter score should only be used if you expect most of the peaks to have similar binding levels, with a small minority lost or gained.

Note that if you do a differential analysis after counting (via dba.analyze()), and then do plots using that contrast, the score will always be whatever normalization was used for the analysis, which depends on the underlying package (generally edgeR, which does the TMM normalization or DESeq).

Hope this helps-

Rory

0
4.9 years ago by
ucarduygu0 wrote:

hi Rory,

I am still trying to understand why these two heat maps are so different in our samples and I have one quick follow-up question regarding your answer. What value is assigned to the non-significant peaks in samples where the peak has not been called for the confidence score heat map calculation ? Do you assign a default score (such as p=0.05)? Maybe the missing peak scores are driving the correlation patterns. What do you think ?

Thank you!

Duygu

Hello Duygu-

I'm not clear on exactly what the question is, perhaps you could supply a code snippet? In the original question there hadn't been any differential analysis done (no call to dba.analyze()), so I'm not sure what significance scores you are referring to. In the first instance, "missing" peaks for a sample get a score of -1. If you have very different peak calls, with only a few peaks called for  some samples, then they would have a lot of -1 values and could indeed correlate closely. If it turned out that the actual read counts in these regions were not that different than other samples, they could then be more highly correlated (and hence closer) to different samples.

-R