Creating DBA object with all data

Question

DiffBind 2.10.0: Consensus peaks seems to miss 2/3 of original sample peaks

0

Entering edit mode

rmendez • 0

@rmendez-15746

Last seen 5.1 years ago

University of North Carolina at Chapel …

Hi,

I have ChIP-seq data from multiple 15, 2 reps each (so 30 total) and multiple conditions. After calling peaks with MACS2 using 2 reps to produce a set of peaks in a highly consistent manner, I upload the data with DiffBind (dba function), creating a consensus peak set of 127,994. Since our dataset is so large, in order to minimize the number of unnecessary independent tests in the Benjamini-Hochgber correction, I would like to subset the consensus peaks to those overlapping with a certain number of samples.

One of the samples to be in the subset has originally 69,554 peaks called by MACS2 on rep1 and rep2 of that samples. However, when I check the overlap between the consensus peak set and that sample to subset the peaks in the consensus that come for that sample, I found only 8,635 consensus peaks that overlap those from the 69,554 peaks, or alternatively (and this is what I do not understand) 8,668 of the sample that overlap any of the 127,994 consensus peaks.

Here I summarize generically the code

Creating DBA object with all data

DBA.object <- dba(sampleSheet = Human.total, minOverlap = 2, peakCaller = "narrow", bRemoveM=FALSE, # Do not remove mithocondrial peaks peakFormat = "bed")

Count reads over consensus peaks

dba.object.count <- dba.count(DBA.object, minOverlap = 2, score=DBASCOREREADS_MINUS, summits = TRUE, # peaks summits are computed WITHOUT re-centering filter = 1, bScaleControl=TRUE)

Extracting GRanges object with consensus peaks (called "universe of peaks"):

gr.universe<- dba.peakset(dba.object.count,dba.object.count$binding,bRetrieve = TRUE, DataType = DBADATAGRANGES,numCols=3)

length(gr.universe) = 127,994

Extracting GRanges object from a particular sample

peakfile <- "WT.VEHMCF7.replicated.Human.narrowPeak" peaks.WT.VEH <- import(peak_file,format = "narrowPeak")

length(peaks.WT.VEH) = 69,554

Before computing the overlap between consensus and sample peak sets, I translate

the peaks in the sample to those seq names in the consensus, since the latter numbers 1 - 57

while the former is chr1 - chr22, chrX, chrY,... etc...

seqnames(gr.universe) factor-Rle of length 127994 with 57 runs Lengths: 10027 5489 4152 7063 3474 5380 4647 3440 ... 1 5 2 1 1 1 2 Values : 1 2 3 4 5 6 7 8 ... 51 52 53 54 55 56 57 Levels(57): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57

seqnames(peaks.WT.VEH) factor-Rle of length 69554 with 5467 runs Lengths: 3345 238 11 1 ... 2 1 1 1 Values : chr10 chr1 chr11 chr1 ... KI270467.1 KI270713.1 KI270719.1 KI270733.1 Levels(34): chr10 chr1 chr11 chr12 chr13 ... KI270442.1 KI270467.1 KI270713.1 KI270719.1 KI270733.1

seqlevels(peaks.WT.VEH) <- as.character(unk_chromosomes2[seqlevels(peaks.WT.VEH)])

seqnames(peaks.WT.VEH) factor-Rle of length 69554 with 5467 runs Lengths: 3345 238 11 1 3 1 5 1 2 ... 2 2 1 1 1 2 1 1 1 Values : 10 1 11 1 11 1 11 1 11 ... 29 31 34 36 37 39 46 47 52 Levels(34): 10 1 11 12 13 14 15 16 17 18 19 20 2 21 22 3 4 5 6 7 8 9 23 26 28 29 31 34 36 37 39 46 47 52

Overlap between consensus of peaks and sample, i.e. which peaks in the Universe of peaks have at

least 1 bp overlap with any peak in the sample

gr.intersect <- gr.universe[gr.universe %over% peaks.WT.VEH]

length(gr.intersect) = 8,635

Overlap between sample and consensus peak set (i.e. how many original sample peaks are overlapping at least 1 bp with the consensus #peaks

gr.intersect2 <- peaks.WT.VEH[peaks.WT.VEH %over% gr.universe]

length(gr.intersect2) = 8,668

While do not expect those length(gr.intersect) AND length(gr.intersect2) to be the same, since the consensus peak and the original peak width are different, so the overlaps are not symmetric. However where are the 69,554 - 8,668 = 60,886 remaining peaks ?

Thanks,

sessionInfo() R version 3.5.2 (2018-12-20) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS: /usr/local/lib64/R/lib/libRblas.so LAPACK: /usr/local/lib64/R/lib/libRlapack.so

locale: [1] LCCTYPE=enUS.UTF-8 LCNUMERIC=C LCTIME=enUS.UTF-8
[4] LCCOLLATE=enUS.UTF-8 LCMONETARY=enUS.UTF-8 LCMESSAGES=enUS.UTF-8
[7] LCPAPER=enUS.UTF-8 LCNAME=C LCADDRESS=C
[10] LCTELEPHONE=C LCMEASUREMENT=enUS.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] grid parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] eulerr6.0.2 data.table1.12.8
[3] SparseSummarizedExperiment0.0.0.9021 remotes2.0.2
[5] readr1.3.1 DiffBind2.10.0
[7] amap0.8-16 hrbrthemes0.6.0
[9] viridis0.5.1 viridisLite0.3.0
[11] ashr2.2-40 apeglm1.4.2
[13] ggpubr0.2 plyr1.8.4
[15] regioneR1.14.0 BSgenome1.50.0
[17] gcapc1.6.0 sva3.30.1
[19] genefilter1.64.0 mgcv1.8-27
[21] nlme3.1-137 ggsci2.9
[23] NMF0.21.0 cluster2.0.7-1
[25] rngtools1.5 pkgmaker0.31
[27] registry0.5-1 M3C1.4.1
[29] magrittr1.5 ggplot23.1.0
[31] PoiClaClu1.0.2.1 RColorBrewer1.1-2
[33] pheatmap1.0.12 SuperExactTest1.0.7
[35] dplyr0.8.0.1 ChIPpeakAnno3.16.1
[37] VennDiagram1.6.20 futile.logger1.4.3
[39] DESeq21.22.2 GenomicAlignments1.18.1
[41] Rsamtools1.34.1 Biostrings2.50.2
[43] XVector0.22.0 SummarizedExperiment1.12.0
[45] DelayedArray0.8.0 BiocParallel1.16.6
[47] matrixStats0.55.0 GenomicFeatures1.34.8
[49] AnnotationDbi1.44.0 Biobase2.42.0
[51] rtracklayer1.42.2 GenomicRanges1.34.0
[53] GenomeInfoDb1.18.2 IRanges2.16.0
[55] S4Vectors0.20.1 BiocGenerics0.28.0
[57] BiocManager_1.30.10

loaded via a [1] tidyselect0.2.5 [5] BatchJobs1.8 [9] withr2.1.2 [13] rstudioapi0.9.0 [17] bbmle1.0.23.1 [21] bit640.9-7 [25] R62.4.0 [29] AnnotationFilter1.6.0 [33] nnet7.3-12 [37] BBmisc1.11 [41] acepack1.4.1 [45] reshape21.4.3 [49] RBGL1.58.2 [53] Rcpp1.0.0 [57] purrr0.3.1 [61] ggrepel0.8.1 [65] mvtnorm1.0-10 [69] hms0.4.2 [73] gridExtra2.3 [77] tibble2.0.1 [81] GOstats2.48.0 [85] sendmailR1.2-1 [89] ShortRead1.40.0 [93] pkgconfig2.0.2 [97] annotate1.60.1 [101] stringr1.4.0 [105] htmlTable1.13.3 [109] GSEABase1.44.0 [113] seqinr3.6-1 [117] httr1.4.0 [121] iterators1.0.10 [125] blob1.1.1 [129] memoise1.1.0 > namespace (and not attached): RSQLite2.1.1 htmlwidgets1.3 Rtsne0.15
munsell0.5.0 codetools0.2-16 systemPipeR1.16.1
colorspace1.4-0 Category2.48.1 knitr1.28
pscl1.5.2 Rttf2pt11.3.8 sigclust1.1.0
GenomeInfoDbData1.2.0 mixsqp0.3-17 hwriter1.3.2
coda0.19-3 lambda.r1.2.4 xfun0.12
doParallel1.0.15 idr1.2 locfit1.5-9.1
bitops1.0-6 assertthat0.2.0 scales1.0.0
gtable0.2.0 ensembldb2.6.8 rlang0.3.1
splines3.5.2 extrafontdb1.0 lazyeval0.2.1
brew1.0-6 checkmate2.0.0 yaml2.2.0
backports1.1.3 Hmisc4.2-0 extrafont0.17
tools3.5.2 gridBase0.4-7 gplots3.0.1.1
base64enc0.1-3 progress1.2.0 zlibbioc1.28.0
RCurl1.98-1.1 prettyunits1.0.2 rpart4.1-13
futile.options1.0.1 truncnorm1.0-8 SQUAREM2017.10-1
matrixcalc1.0-3 ProtGenerics1.14.0 evaluate0.13
xtable1.8-3 XML3.98-1.18 emdbook1.3.11
bdsmatrix1.3-4 compiler3.5.2 biomaRt2.38.0
KernSmooth2.23-15 crayon1.3.4 htmltools0.3.6
Formula1.2-3 snow0.4-3 geneplotter1.60.0
DBI1.0.0 formatR1.7 MASS7.3-51.1
Matrix1.2-15 ade41.7-13 gdata2.18.0
numDeriv2016.8-1 foreign0.8-71 foreach1.4.4
multtest2.38.0 AnnotationForge1.24.0 bibtex0.4.2
digest0.6.18 graph1.60.0 rmarkdown2.1
dendextend1.13.2 edgeR3.24.3 gdtools0.1.7
curl3.3 gtools3.8.1 rjson0.2.20
limma3.38.3 pillar1.3.1 lattice0.20-38
survival2.43-3 GO.db3.7.0 glue1.3.0
bit1.1-14 Rgraphviz2.26.0 stringi1.3.1
doSNOW1.0.16 latticeExtra0.6-28 caTools1.17.1.1
irlba2.3.3

DiffBind Consensus peak • 466 views

ADD COMMENT • link updated 5.1 years ago by Rory Stark ★ 5.2k • written 5.1 years ago by rmendez • 0

score 0 · Answer 1 · 2020-03-15

The "universe" of 128k peaks includes those that overlap in at least 2 of the 30 peaksets. Any one of the 30 peaksets may have a low rate of overlap. So it may be the case that the one you chose with 69k peaks really does only have 8k peaks that overlap with at least one other peakset.

However I suspect the real problem is that you should NOT be specifying the internal variable dba.object.count$binding to dba.peakset() when you are retrieving the gr.universe. Leaving this out will retrieve the consensus peaks. The way you are doing it results in the retrieved GRanges object to be using internal codes for the seqnames. I'm not sure you are mapping the chromosome names correctly. If there are any errors in assigning chromosome names, you will get a lower rate of overlap. If you leave out the reference to dba.object.count$binding, you'll get the correct chromosome names and won't have to re-map.