DiffBind : How is computed the correlation reported when using dba()
2
0
Entering edit mode
@pierre-francois-roux-7997
Last seen 3.4 years ago
France

Dear all,

I am working on different histone modification ChIP-seq datasets (K4Me1, K4Me3 and K27Ac) in the scope of a time course. Each of these ChIP-seq experiments have been ran twice. Basically, that means I have duplicates (and matched input) for each of these histone modification at each time point.

After global filtering on reads, I called peaks using MACS2 considering matched inputs. To perform the differential binding analysis, I finally ran the DiffBind package on those data.

When I created the DiffBind dba object with :

Expe_K4Me1 <- dba(sampleSheet=Samples_K4Me1,skipLines = 1,bRemoveM=TRUE,minOverlap = 1)

I had a surprisingly low correlation (< 0.2) between biological replicates. Here is an example for K4Me1, showing the correlation heatmap reported by DiffBind when creating the dba object (D0K4Me1_1 and D0K4Me1_2 are replicates for the time point D0, while D6K4Me1_1 and D6K4Me1_2 are replicates for the time point D6).

Before running DiffBind, I checked the overall concordance between replicates, looking at the distribution of reads in 1000 bp bins across the genome and get correlation > 0.9. Moreover, after peak calling, I checked the consistency between replicates by computing the overlap between peaks detected by MACS in each replicates for a given time point using BedTools. Once again, I get satisfying results (more than 80 % of peaks detected in both replicates).

I am wondering :

• What is called  “correlation” at this step of the analysis, since we are only looking at the overlap between peaks detected in each experiment ? Should I understand "correlation" = "overlapping rate" ?
• How DiffBind compute the correlation at this step of the analysis ?
• If "correlation" = "overalpping rate", why is the overlap reported by DiffBind so weak in this example, while it was pretty good using BedTools ?

Many thanks for you advices and help.

Cheers,

Pierre-François

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: CentOS release 6.5 (Final)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] org.Hs.eg.db_3.1.2      GenomicFeatures_1.20.0  AnnotationDbi_1.30.1    Biobase_2.28.0          ChIPpeakAnno_3.2.2
[6] biomaRt_2.24.0          VennDiagram_1.6.9       ChIPseeker_1.4.1        DiffBind_1.14.3         RSQLite_1.0.0
[11] DBI_0.3.1               locfit_1.5-9.1          GenomicAlignments_1.4.1 Rsamtools_1.20.2        Biostrings_2.36.0
[16] XVector_0.8.0           limma_3.24.3            GenomicRanges_1.20.3    GenomeInfoDb_1.4.0      IRanges_2.2.1
[21] S4Vectors_0.6.0         BiocGenerics_0.14.0

diffbind • 1.3k views
3
Entering edit mode
Rory Stark ★ 4.1k
@rory-stark-5741
Last seen 20 days ago
CRUK, Cambridge, UK

Hello Pierre-François-

The correlations are based scores generated by the peak caller. If you are using MACS2 _broad_peaks.bed files, which I believe use the -log10(pval) as the score, these will be normalized to a 0...1 scale by dividing the scores in each peakset by the maximum score. If a peak is not called for a sample, it will be assigned a score of -1. The correlations are by default the person correlation of each sample's peak scores with each other samples scores  (logged before the correlation is computed). You can see the correlation matrix as follows:

> corvals <- plot(Expe_K4Me1) > corvals

As it turns out, the noise in peak calling combined with the nature of p-value distributions don't make for a great way to do scoring, which is why the correlation heatmaps are more reliable after you have done the counting stage using dba.count(). This will form a count score for every peak interval for every sample, regardless of whether it was called as a peak. It would be worth a try to do the counting to see if the correlations are similar to those you calculated using bedtools.

You can check the overlaps as well. For example the following will generate Venn diagrams for the replicates:

> par(mfrow=c(2,1))
> dba.plotVenn(Expe_K4Me1,mask=Expe_K4Me1$masks$D0)
> dba.plotVenn(Expe_K4Me1,mask=Expe_K4Me1$masks$D6)

You can check overall overlapping rates by computing the number of sites that overlap one, two, three, or all four samples, or by plotting a 4-way Venn:

> dba.overlap(Expe_K4Me1,mode=DBA_OLAP_RATE)
> par(mfrow=c(1,1))
> dba.plotVenn(Expe_K4Me1,mask=1:4)

Cheers-
Rory

0
Entering edit mode
@pierre-francois-roux-7997
Last seen 3.4 years ago
France

Hi Rory,

Many thanks for this great explanation.

I will try to run dba.count on the whole peaks, before defining consensus peaksets for peaks shared by biological replicates, to see whether it improves the overall correlation.

Cheers !

Pierre-François

0
Entering edit mode

Runing dba.count gave result closer to what I was expecting.

> dba.count(Expe_K4Me1)

Even if the information is different from the overlapping rate of called peaks between each replicates, it is interesting to consider it.

Thanks again !

Pef