This is a question on the "summit" function in DiffBind.
I would like to generate a set of consensus peaks with width of 201bp, centered around the peak summit. The codes I used were:
dba.object <- dba( sampleSheet = metadata, peakCaller="bed", scoreCol=5, bRemoveM=TRUE, bLowerScoreBetter=TRUE, config=data.frame(bUsePval=TRUE), bCorPlot=FALSE) count.data <- dba.count( dba.object, minOverlap=1, score=DBA_SCORE_READS, bCorPlot=FALSE, summits=100, bUseSummarizeOverlaps=F )
To retrieve the count matrix, along with the peak coordinates, I used:
count.matrix <- dba.peakset(count.data, bRetrieve=T, DataType=DBA_DATA_FRAME)
However, I realized that there were a very small set of peaks larger than 201bp in the final consensus peak set (approximately 5% of the consensus peak set). I am unable to figure out why this would be so and would appreciate any assistance on this.
My saved dba.object can be retrieved from here: https://www.dropbox.com/s/eh3qpanm0lgdn6f/dba_object_debug.rds?dl=0
The DiffBind version I used was 2.2.9
Thank you,
Mei San
sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-centos-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] DiffBind_2.2.9 SummarizedExperiment_1.4.0 [3] Biobase_2.34.0 GenomicRanges_1.26.2 [5] GenomeInfoDb_1.10.2 IRanges_2.8.0 [7] S4Vectors_0.14.4 BiocGenerics_0.22.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.11 locfit_1.5-9.1 lattice_0.20-34 [4] GO.db_3.4.0 Rsamtools_1.26.1 Biostrings_2.42.1 [7] gtools_3.5.0 assertthat_0.2.0 digest_0.6.12 [10] R6_2.2.2 plyr_1.8.4 BatchJobs_1.6 [13] backports_1.0.5 ShortRead_1.32.1 RSQLite_1.0.0 [16] ggplot2_2.2.1 gplots_3.0.1 zlibbioc_1.20.0 [19] rlang_0.1.1 GenomicFeatures_1.26.2 lazyeval_0.2.0 [22] annotate_1.52.1 gdata_2.18.0 Matrix_1.2-7.1 [25] checkmate_1.8.2 systemPipeR_1.8.1 GOstats_2.40.0 [28] splines_3.3.2 BiocParallel_1.8.1 stringr_1.2.0 [31] pheatmap_1.0.8 RCurl_1.95-4.8 biomaRt_2.30.0 [34] munsell_0.4.3 sendmailR_1.2-1 rtracklayer_1.34.1 [37] pkgconfig_2.0.1 base64enc_0.1-3 BBmisc_1.11 [40] fail_1.3 tibble_1.3.3 edgeR_3.16.5 [43] XML_3.98-1.5 AnnotationForge_1.16.0 dplyr_0.7.1 [46] GenomicAlignments_1.10.0 bitops_1.0-6 grid_3.3.2 [49] RBGL_1.50.0 xtable_1.8-2 GSEABase_1.36.0 [52] gtable_0.2.0 DBI_0.5-1 magrittr_1.5 [55] scales_0.4.1 graph_1.52.0 KernSmooth_2.23-15 [58] amap_0.8-14 stringi_1.1.5 XVector_0.14.0 [61] hwriter_1.3.2 genefilter_1.56.0 bindrcpp_0.2 [64] limma_3.30.8 latticeExtra_0.6-28 brew_1.0-6 [67] rjson_0.2.15 RColorBrewer_1.1-2 tools_3.3.2 [70] glue_1.1.1 Category_2.40.0 survival_2.39-5 [73] AnnotationDbi_1.36.0 colorspace_1.3-2 caTools_1.17.1 [76] bindr_0.1
Hi Rory,
Thank you for the clarification. I think the fix you are planning would fit in better with the goal of the "summit" function. I look forward to the new release.
Thanks,
Mei San
Sometime, when we have some spare time (haha!) we should consider using PeakSplitter or similar to split distinct but nearby peaks, rather than discarding all but one. It (probably) wouldn't be that hard to write a wrapper R package around PeakSplitter's C++ core... I'll see what Mali (the author) thinks of the idea.
Cheers, - Gord