Dear all,
I am working on different histone modification ChIP-seq datasets (K4Me1, K4Me3 and K27Ac) in the scope of a time course. Each of these ChIP-seq experiments have been ran twice. Basically, that means I have duplicates (and matched input) for each of these histone modification at each time point.
After global filtering on reads, I called peaks using MACS2 considering matched inputs. To perform the differential binding analysis, I finally ran the DiffBind package on those data.
When I created the DiffBind dba object with :
Expe_K4Me1 <- dba(sampleSheet=Samples_K4Me1,skipLines = 1,bRemoveM=TRUE,minOverlap = 1)
I had a surprisingly low correlation (< 0.2) between biological replicates. Here is an example for K4Me1, showing the correlation heatmap reported by DiffBind when creating the dba object (D0K4Me1_1 and D0K4Me1_2 are replicates for the time point D0, while D6K4Me1_1 and D6K4Me1_2 are replicates for the time point D6).
Before running DiffBind, I checked the overall concordance between replicates, looking at the distribution of reads in 1000 bp bins across the genome and get correlation > 0.9. Moreover, after peak calling, I checked the consistency between replicates by computing the overlap between peaks detected by MACS in each replicates for a given time point using BedTools. Once again, I get satisfying results (more than 80 % of peaks detected in both replicates).
I am wondering :
- What is called “correlation” at this step of the analysis, since we are only looking at the overlap between peaks detected in each experiment ? Should I understand "correlation" = "overlapping rate" ?
- How DiffBind compute the correlation at this step of the analysis ?
- If "correlation" = "overalpping rate", why is the overlap reported by DiffBind so weak in this example, while it was pretty good using BedTools ?
Many thanks for you advices and help.
Cheers,
Pierre-François
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: CentOS release 6.5 (Final)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] grid stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] org.Hs.eg.db_3.1.2 GenomicFeatures_1.20.0 AnnotationDbi_1.30.1 Biobase_2.28.0 ChIPpeakAnno_3.2.2
[6] biomaRt_2.24.0 VennDiagram_1.6.9 ChIPseeker_1.4.1 DiffBind_1.14.3 RSQLite_1.0.0
[11] DBI_0.3.1 locfit_1.5-9.1 GenomicAlignments_1.4.1 Rsamtools_1.20.2 Biostrings_2.36.0
[16] XVector_0.8.0 limma_3.24.3 GenomicRanges_1.20.3 GenomeInfoDb_1.4.0 IRanges_2.2.1
[21] S4Vectors_0.6.0 BiocGenerics_0.14.0
Runing dba.count gave result closer to what I was expecting.
> dba.count(Expe_K4Me1)
Even if the information is different from the overlapping rate of called peaks between each replicates, it is interesting to consider it.
Thanks again !
Pef