Hi,
I have ChIP-seq data from multiple 15, 2 reps each (so 30 total) and multiple conditions. After calling peaks with MACS2 using 2 reps to produce a set of peaks in a highly consistent manner, I upload the data with DiffBind (dba function), creating a consensus peak set of 127,994. Since our dataset is so large, in order to minimize the number of unnecessary independent tests in the Benjamini-Hochgber correction, I would like to subset the consensus peaks to those overlapping with a certain number of samples.
One of the samples to be in the subset has originally 69,554 peaks called by MACS2 on rep1 and rep2 of that samples. However, when I check the overlap between the consensus peak set and that sample to subset the peaks in the consensus that come for that sample, I found only 8,635 consensus peaks that overlap those from the 69,554 peaks, or alternatively (and this is what I do not understand) 8,668 of the sample that overlap any of the 127,994 consensus peaks.
Here I summarize generically the code
Creating DBA object with all data
DBA.object <- dba(sampleSheet = Human.total, minOverlap = 2, peakCaller = "narrow", bRemoveM=FALSE, # Do not remove mithocondrial peaks peakFormat = "bed")
Count reads over consensus peaks
dba.object.count <- dba.count(DBA.object, minOverlap = 2, score=DBASCOREREADS_MINUS, summits = TRUE, # peaks summits are computed WITHOUT re-centering filter = 1, bScaleControl=TRUE)
Extracting GRanges object with consensus peaks (called "universe of peaks"):
gr.universe<- dba.peakset(dba.object.count,dba.object.count$binding,bRetrieve = TRUE, DataType = DBADATAGRANGES,numCols=3)
length(gr.universe) = 127,994
Extracting GRanges object from a particular sample
peakfile <- "WT.VEHMCF7.replicated.Human.narrowPeak" peaks.WT.VEH <- import(peak_file,format = "narrowPeak")
length(peaks.WT.VEH) = 69,554
Before computing the overlap between consensus and sample peak sets, I translate
the peaks in the sample to those seq names in the consensus, since the latter numbers 1 - 57
while the former is chr1 - chr22, chrX, chrY,... etc...
seqnames(gr.universe) factor-Rle of length 127994 with 57 runs Lengths: 10027 5489 4152 7063 3474 5380 4647 3440 ... 1 5 2 1 1 1 2 Values : 1 2 3 4 5 6 7 8 ... 51 52 53 54 55 56 57 Levels(57): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
seqnames(peaks.WT.VEH) factor-Rle of length 69554 with 5467 runs Lengths: 3345 238 11 1 ... 2 1 1 1 Values : chr10 chr1 chr11 chr1 ... KI270467.1 KI270713.1 KI270719.1 KI270733.1 Levels(34): chr10 chr1 chr11 chr12 chr13 ... KI270442.1 KI270467.1 KI270713.1 KI270719.1 KI270733.1
seqlevels(peaks.WT.VEH) <- as.character(unk_chromosomes2[seqlevels(peaks.WT.VEH)])
seqnames(peaks.WT.VEH) factor-Rle of length 69554 with 5467 runs Lengths: 3345 238 11 1 3 1 5 1 2 ... 2 2 1 1 1 2 1 1 1 Values : 10 1 11 1 11 1 11 1 11 ... 29 31 34 36 37 39 46 47 52 Levels(34): 10 1 11 12 13 14 15 16 17 18 19 20 2 21 22 3 4 5 6 7 8 9 23 26 28 29 31 34 36 37 39 46 47 52
Overlap between consensus of peaks and sample, i.e. which peaks in the Universe of peaks have at
least 1 bp overlap with any peak in the sample
gr.intersect <- gr.universe[gr.universe %over% peaks.WT.VEH]
length(gr.intersect) = 8,635
Overlap between sample and consensus peak set (i.e. how many original sample peaks are overlapping at least 1 bp with the consensus #peaks
gr.intersect2 <- peaks.WT.VEH[peaks.WT.VEH %over% gr.universe]
length(gr.intersect2) = 8,668
While do not expect those length(gr.intersect) AND length(gr.intersect2) to be the same, since the consensus peak and the original peak width are different, so the overlaps are not symmetric. However where are the 69,554 - 8,668 = 60,886 remaining peaks ?
Thanks,
sessionInfo() R version 3.5.2 (2018-12-20) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)
Matrix products: default BLAS: /usr/local/lib64/R/lib/libRblas.so LAPACK: /usr/local/lib64/R/lib/libRlapack.so
locale:
[1] LCCTYPE=enUS.UTF-8 LCNUMERIC=C LCTIME=enUS.UTF-8
[4] LCCOLLATE=enUS.UTF-8 LCMONETARY=enUS.UTF-8 LCMESSAGES=enUS.UTF-8
[7] LCPAPER=enUS.UTF-8 LCNAME=C LCADDRESS=C
[10] LCTELEPHONE=C LCMEASUREMENT=enUS.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] grid parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] eulerr6.0.2 data.table1.12.8
[3] SparseSummarizedExperiment0.0.0.9021 remotes2.0.2
[5] readr1.3.1 DiffBind2.10.0
[7] amap0.8-16 hrbrthemes0.6.0
[9] viridis0.5.1 viridisLite0.3.0
[11] ashr2.2-40 apeglm1.4.2
[13] ggpubr0.2 plyr1.8.4
[15] regioneR1.14.0 BSgenome1.50.0
[17] gcapc1.6.0 sva3.30.1
[19] genefilter1.64.0 mgcv1.8-27
[21] nlme3.1-137 ggsci2.9
[23] NMF0.21.0 cluster2.0.7-1
[25] rngtools1.5 pkgmaker0.31
[27] registry0.5-1 M3C1.4.1
[29] magrittr1.5 ggplot23.1.0
[31] PoiClaClu1.0.2.1 RColorBrewer1.1-2
[33] pheatmap1.0.12 SuperExactTest1.0.7
[35] dplyr0.8.0.1 ChIPpeakAnno3.16.1
[37] VennDiagram1.6.20 futile.logger1.4.3
[39] DESeq21.22.2 GenomicAlignments1.18.1
[41] Rsamtools1.34.1 Biostrings2.50.2
[43] XVector0.22.0 SummarizedExperiment1.12.0
[45] DelayedArray0.8.0 BiocParallel1.16.6
[47] matrixStats0.55.0 GenomicFeatures1.34.8
[49] AnnotationDbi1.44.0 Biobase2.42.0
[51] rtracklayer1.42.2 GenomicRanges1.34.0
[53] GenomeInfoDb1.18.2 IRanges2.16.0
[55] S4Vectors0.20.1 BiocGenerics0.28.0
[57] BiocManager_1.30.10
loaded via a namespace (and not attached):
[1] tidyselect0.2.5 RSQLite2.1.1 htmlwidgets1.3 Rtsne0.15
[5] BatchJobs1.8 munsell0.5.0 codetools0.2-16 systemPipeR1.16.1
[9] withr2.1.2 colorspace1.4-0 Category2.48.1 knitr1.28
[13] rstudioapi0.9.0 pscl1.5.2 Rttf2pt11.3.8 sigclust1.1.0
[17] bbmle1.0.23.1 GenomeInfoDbData1.2.0 mixsqp0.3-17 hwriter1.3.2
[21] bit640.9-7 coda0.19-3 lambda.r1.2.4 xfun0.12
[25] R62.4.0 doParallel1.0.15 idr1.2 locfit1.5-9.1
[29] AnnotationFilter1.6.0 bitops1.0-6 assertthat0.2.0 scales1.0.0
[33] nnet7.3-12 gtable0.2.0 ensembldb2.6.8 rlang0.3.1
[37] BBmisc1.11 splines3.5.2 extrafontdb1.0 lazyeval0.2.1
[41] acepack1.4.1 brew1.0-6 checkmate2.0.0 yaml2.2.0
[45] reshape21.4.3 backports1.1.3 Hmisc4.2-0 extrafont0.17
[49] RBGL1.58.2 tools3.5.2 gridBase0.4-7 gplots3.0.1.1
[53] Rcpp1.0.0 base64enc0.1-3 progress1.2.0 zlibbioc1.28.0
[57] purrr0.3.1 RCurl1.98-1.1 prettyunits1.0.2 rpart4.1-13
[61] ggrepel0.8.1 futile.options1.0.1 truncnorm1.0-8 SQUAREM2017.10-1
[65] mvtnorm1.0-10 matrixcalc1.0-3 ProtGenerics1.14.0 evaluate0.13
[69] hms0.4.2 xtable1.8-3 XML3.98-1.18 emdbook1.3.11
[73] gridExtra2.3 bdsmatrix1.3-4 compiler3.5.2 biomaRt2.38.0
[77] tibble2.0.1 KernSmooth2.23-15 crayon1.3.4 htmltools0.3.6
[81] GOstats2.48.0 Formula1.2-3 snow0.4-3 geneplotter1.60.0
[85] sendmailR1.2-1 DBI1.0.0 formatR1.7 MASS7.3-51.1
[89] ShortRead1.40.0 Matrix1.2-15 ade41.7-13 gdata2.18.0
[93] pkgconfig2.0.2 numDeriv2016.8-1 foreign0.8-71 foreach1.4.4
[97] annotate1.60.1 multtest2.38.0 AnnotationForge1.24.0 bibtex0.4.2
[101] stringr1.4.0 digest0.6.18 graph1.60.0 rmarkdown2.1
[105] htmlTable1.13.3 dendextend1.13.2 edgeR3.24.3 gdtools0.1.7
[109] GSEABase1.44.0 curl3.3 gtools3.8.1 rjson0.2.20
[113] seqinr3.6-1 limma3.38.3 pillar1.3.1 lattice0.20-38
[117] httr1.4.0 survival2.43-3 GO.db3.7.0 glue1.3.0
[121] iterators1.0.10 bit1.1-14 Rgraphviz2.26.0 stringi1.3.1
[125] blob1.1.1 doSNOW1.0.16 latticeExtra0.6-28 caTools1.17.1.1
[129] memoise1.1.0 irlba2.3.3
>