Questions about DiffBind Package
2
0
Entering edit mode
ucarduygu • 0
@ucarduygu-7136
Last seen 9.9 years ago

Hi, 

I have two questions about how DiffBind works (listed below).  Could you hep me figure these out. Thank you for your help in advance! 

Best, Duygu Ucar. 

  1. First, why do the correlation plots generated from the same data (no contrast) look very different. For example when I run the below two commands I generated two correlation plots and they are substantially different, first one is in fact what we expect to see. I tried to understand what is causing this difference but I could not figure it out from the description of functions. Could you help me with this ?
    • pbmc.subset <- dba(sampleSheet="ND5_diffbind_v2.csv")
    • pbmc.subset <- dba.count(pbmc.subset,bParallel=TRUE)
  2. In the case of no control samples, which normalization/summary method do you recommend? Are these significantly different ? Do you have any guidelines/insights to follow for different types of datasets? 

DBA_SCORE_READS

raw read count for interval using only reads from ChIP

DBA_SCORE_READS_FOLD

raw read count for interval from ChIP divided by read count for interval from control

DBA_SCORE_READS_MINUS

raw read count for interval from ChIP minus read count for interval from control

DBA_SCORE_RPKM

RPKM for interval using only reads from ChIP

DBA_SCORE_RPKM_FOLD

RPKM for interval from ChIP divided by RPKM for interval from control

DBA_SCORE_TMM_READS_FULL

TMM normalized (using edgeR), using ChIP read counts and Full Library size

DBA_SCORE_TMM_READS_EFFECTIVE

TMM normalized (using edgeR), using ChIP read counts and Effective Library size

DBA_SCORE_TMM_MINUS_FULL

TMM normalized (using edgeR), using ChIP read counts minus Control read counts and Full Library size

DBA_SCORE_TMM_MINUS_EFFECTIVE

TMM normalized (using edgeR), using ChIP read counts minus Control read counts and Effective Library size

DBA_SCORE_TMM_READS_FULL_CPM

same as DBA_SCORE_TMM_READS_FULL, but reporrted in counts-per-million.

DBA_SCORE_TMM_READS_EFFECTIVE_CPM

same as DBA_SCORE_TMM_READS_EFFECTIVE, but reporrted in counts-per-million.

DBA_SCORE_TMM_MINUS_FULL_CPM

same as DBA_SCORE_TMM_MINUS_FULL, but reporrted in counts-per-million.

DBA_SCORE_TMM_MINUS_EFFECTIVE_CPM

Tsame as DBA_SCORE_TMM_MINUS_EFFECTIVE, but reporrted in counts-per-million.

DBA_SCORE_SUMMIT

summit height (maximum read pileup value)

DBA_SCORE_SUMMIT_ADJ

summit height (maximum read pileup value), normalized to relative library size

DBA_SCORE_SUMMIT_POS

summit position (location of maximum read pileup)

 

diffbind normalization • 2.9k views
ADD COMMENT
1
Entering edit mode
Rory Stark ★ 5.2k
@rory-stark-5741
Last seen 5 weeks ago
Cambridge, UK

Hello Duygu-

For the first question, the correlation heatmap in the two cases is based on potentially quite different data. In the first case, when you read in the sample sheet, you are correlating peak scores, while in the second case, you are correlating scores based directly on read counts. Peak scores are generally some statistic that was generated by the peak caller. As I don't know what your peak files look like, I don't know exactly what the scores are, but any peaks identified somewhere in the experiment, but not called for a specific sample, will have a missing score (set to -1), so the score vectors being correlating may be quite sparse. After invoking dba.count()every sample will have some read count score for every peak, which can yield quite different clustering results.

In many cases, the clustering based on the peak scores vs read scores are very similar. When they differ, in my experience that usually points to an issue in peak calling, where peaks are being called inconsistently. One thing to check is the overall overlap rate; if this curve drops off too quickly (worse than geometric), that indicates that there is little agreement between the peaksets called for each sample:

> pbmc.subset <- dba(sampleSheet="ND5_diffbind_v2.csv")
> rate <-  dba.overlap(pbmc.subset, mode=DBA_OLAP_RATE)
> plot(rate,type='b',xlab="# peaksets",ylab="# common peaks")

For the second question, the control sample isn't as important for DiffBind as it will be for peak calling (if you don't have control samples, this may be why the peak calling is inconsistent above). There are no hard and fast rules for choosing a scoring system. DBA_SCORE_RPKM is generally a good one to try, as this is just the read counts normalized to the library depth and the peak width. I usually use one of the DBA_SCORE_TMM_ scores, which do a more sophisticated normalization (see Robinson and Oshlack, Genome Biology 2010).  In your case, with no control reads, you could use either DBA_SCORE_TMM_READS_FULL, which will normalize to the overall depth of sequencing in each library, or DBA_SCORE_TMM_READS_EFFECTIVE, which normalizes to the number of reads actually overlapping peaks. The latter score should only be used if you expect most of the peaks to have similar binding levels, with a small minority lost or gained.

Note that if you do a differential analysis after counting (via dba.analyze()), and then do plots using that contrast, the score will always be whatever normalization was used for the analysis, which depends on the underlying package (generally edgeR, which does the TMM normalization or DESeq).

Hope this helps-

Rory

ADD COMMENT
0
Entering edit mode
ucarduygu • 0
@ucarduygu-7136
Last seen 9.9 years ago

hi Rory, 

Thank you for your reply! This is really useful. 

I am still trying to understand why these two heat maps are so different in our samples and I have one quick follow-up question regarding your answer. What value is assigned to the non-significant peaks in samples where the peak has not been called for the confidence score heat map calculation ? Do you assign a default score (such as p=0.05)? Maybe the missing peak scores are driving the correlation patterns. What do you think ?

Thank you!

Duygu

ADD COMMENT
0
Entering edit mode

Hello Duygu-

I'm not clear on exactly what the question is, perhaps you could supply a code snippet? In the original question there hadn't been any differential analysis done (no call to dba.analyze()), so I'm not sure what significance scores you are referring to. In the first instance, "missing" peaks for a sample get a score of -1. If you have very different peak calls, with only a few peaks called for  some samples, then they would have a lot of -1 values and could indeed correlate closely. If it turned out that the actual read counts in these regions were not that different than other samples, they could then be more highly correlated (and hence closer) to different samples.

 

-R

ADD REPLY

Login before adding your answer.

Traffic: 520 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6