Enter the body of text here
1: How to extract the counts of each replicate after dba.count while doing the occupancy analysis? I would like develop a genomic ranges object that consist of consensus peaks (chr, start, end,) and in addition to score two column showing the normalized counts of each replicate in only these consensus peak set.
2: Also, why counts(48923) are different than peak set (52772)?
Below is my code. Thanks a lot for the help!
# Data consist of only one condition with two replicates. First, made a consensus peak set.
Tet1_consensus <- dba.peakset(Tet1.BL, consensus= c(DBA_TISSUE, DBA_FACTOR), minOverlap=2)
Tet1_consensus <- dba(Tet1_consensus, mask=Tet1_consensus$masks$Consensus)
consensusPeaks <- dba.peakset(Tet1_consensus, bRetrieve = TRUE,bRemoveM = TRUE, bRemoveRandom= TRUE)
Tet1.count <- dba.count(Tet1.BL, peaks=Tet1consensusPeaks, summits=200,bUseSummarizeOverlaps=TRUE, score= "DBA_SCORE_NORMALIZED",bParallel= Tet1.BL$config$RunParallel)
Tet1.count
# RESULTS
> Tet1_consensus
3 Samples, 52772 sites in matrix (182790 total):
ID Tissue Factor Condition Replicate Intervals
1 Tet1_wt_24h_rep1 wt wt_24h 24h 1 82980
2 Tet1_wt_24h_rep2 wt wt_24h 24h 2 157641
3 ALL wt wt_24h 24h 1-2 52772
> consensusPeaks
GRanges object with 52772 ranges and 1 metadata column:
seqnames ranges strand | ALL
<Rle> <IRanges> <Rle> | <numeric>
1 chr1 3264111-3264373 * | 0.00559169
2 chr1 3344674-3344984 * | 0.00960388
3 chr1 3531530-3532044 * | 0.00877813
4 chr1 3532986-3533302 * | 0.00621673
5 chr1 3556829-3557065 * | 0.00428748
... ... ... ... . ...
52768 chrY 5364048-5364360 * | 0.00425416
52769 chrY 9769327-9769664 * | 0.00823932
52770 chrY 9820370-9820976 * | 0.01097052
52771 chrY 10165336-10166194 * | 0.00799716
52772 chrY_JH584303_random 31923-32397 * | 0.00740656
Tet1.count
2 Samples, 48923 sites in matrix:
ID Tissue Factor Condition Replicate Reads FRiP
1 Tet1_wt_24h_rep1 wt wt_24h 24h 1 32518176 0.09
2 Tet1_wt_24h_rep2 wt wt_24h 24h 2 25186571 0.10
sessionInfo( )
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
locale:
[1] en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] LSD_4.1-0 DiffBind_3.2.5 SummarizedExperiment_1.22.0
[4] Biobase_2.52.0 MatrixGenerics_1.4.0 matrixStats_0.58.0
[7] GenomicRanges_1.44.0 GenomeInfoDb_1.28.0 IRanges_2.26.0
[10] S4Vectors_0.30.0 BiocGenerics_0.38.0
loaded via a namespace (and not attached):
[1] GOstats_2.58.0 backports_1.2.1 BiocFileCache_2.0.0 plyr_1.8.6
[5] GSEABase_1.54.0 splines_4.1.0 BiocParallel_1.26.0 ggplot2_3.3.5
[9] amap_0.8-18 digest_0.6.27 invgamma_1.1 htmltools_0.5.1.1
[13] GO.db_3.13.0 SQUAREM_2021.1 fansi_0.5.0 magrittr_2.0.1
[17] checkmate_2.0.0 memoise_2.0.0 BSgenome_1.60.0 base64url_1.4
[21] limma_3.48.0 Biostrings_2.60.0 annotate_1.70.0 systemPipeR_1.26.3
[25] bdsmatrix_1.3-4 prettyunits_1.1.1 jpeg_0.1-8.1 colorspace_2.0-1
[29] blob_1.2.1 rappdirs_0.3.3 apeglm_1.14.0 ggrepel_0.9.1
[33] xfun_0.23 dplyr_1.0.7 crayon_1.4.1 RCurl_1.98-1.3
[37] jsonlite_1.7.2 graph_1.70.0 genefilter_1.74.0 brew_1.0-6
[41] survival_3.2-11 VariantAnnotation_1.38.0 glue_1.4.2 gtable_0.3.0
[45] zlibbioc_1.38.0 XVector_0.32.0 DelayedArray_0.18.0 V8_3.4.2
[49] Rgraphviz_2.36.0 scales_1.1.1 pheatmap_1.0.12 mvtnorm_1.1-1
[53] DBI_1.1.1 edgeR_3.34.0 Rcpp_1.0.6 xtable_1.8-4
[57] progress_1.2.2 emdbook_1.3.12 bit_4.0.4 rsvg_2.1.2
[61] truncnorm_1.0-8 AnnotationForge_1.34.0 httr_1.4.2 gplots_3.1.1
[65] RColorBrewer_1.1-2 ellipsis_0.3.2 pkgconfig_2.0.3 XML_3.99-0.6
[69] sass_0.4.0 dbplyr_2.1.1 locfit_1.5-9.4 utf8_1.2.1
[73] tidyselect_1.1.1 rlang_0.4.11 AnnotationDbi_1.54.0 munsell_0.5.0
[77] tools_4.1.0 cachem_1.0.5 generics_0.1.0 RSQLite_2.2.7
[81] evaluate_0.14 stringr_1.4.0 fastmap_1.1.0 yaml_2.2.1
[85] knitr_1.33 bit64_4.0.5 caTools_1.18.2 purrr_0.3.4
[89] KEGGREST_1.32.0 RBGL_1.68.0 biomaRt_2.48.0 compiler_4.1.0
[93] rstudioapi_0.13 filelock_1.0.2 curl_4.3.1 png_0.1-7
[97] geneplotter_1.70.0 tibble_3.1.2 bslib_0.2.5.1 stringi_1.6.2
[101] GenomicFeatures_1.44.0 lattice_0.20-44 Matrix_1.3-3 vctrs_0.3.8
[105] pillar_1.6.1 lifecycle_1.0.0 BiocManager_1.30.15 jquerylib_0.1.4
[109] irlba_2.3.3 data.table_1.14.0 bitops_1.0-7 rtracklayer_1.52.0
[113] R6_2.5.0 BiocIO_1.2.0 latticeExtra_0.6-29 hwriter_1.3.2
[117] ShortRead_1.50.0 KernSmooth_2.23-20 MASS_7.3-54 gtools_3.8.2
[121] assertthat_0.2.1 DESeq2_1.32.0 Category_2.58.0 rjson_0.2.20
[125] withr_2.4.2 GenomicAlignments_1.28.0 batchtools_0.9.15 Rsamtools_2.8.0
[129] GenomeInfoDbData_1.2.6 hms_1.1.0 grid_4.1.0 DOT_0.1
[133] coda_0.19-4 rmarkdown_2.8 GreyListChIP_1.24.0 ashr_2.2-47
[137] mixsqp_0.3-43 bbmle_1.0.23.1 numDeriv_2016.8-1.1 restfulr_0.0.13
Dear Rory, thanks for the response. Following your suggestion, i can now get the count of both replicates.
However, the length issue stays; Tet1consensusCount is still not the same as consensusPeaks even if i change the summit=50, or summit= FALSE.
Tet1.count <- dba.count(Tet1.BL, peaks=Tet1consensusPeaks, summits=FALSE,score=DBA_SCORE_NORMALIZED, bUseSummarizeOverlaps=TRUE) Tet1.count consensusCount <- dba.peakset(Tet1.count, bRetrieve=TRUE) GRanges object with 48229 ranges and 2 metadata columns: seqnames ranges strand | Tet1_wt_24h_rep1 Tet1_wt_24h_rep2 <Rle> <IRanges> <Rle> | <numeric> <numeric> 1 chr1 3264111-3264373 | 9.75996 22.9109 2 chr1 3531530-3532044 | 21.29446 53.8407 3 chr1 3532986-3533302 | 14.19631 18.3287 4 chr1 3556829-3557065 | 9.75996 12.6010 5 chr1 3593397-3593597 * | 3.54908 12.6010
Question now is, 1: Since its an occupancy analysis, I need these GRnages to overlap with other TF ChIPseq data sets. Should I use GRnages of "consensusPeaks" or GRnages of consensusCount. The later gives me read count of each replicate which is meaningful in some downstream analysis. 2: I may have totally misunderstood but I am confused whether one should use summit or not at all? what is the benefit beside it gives the GRanges of the same length?
3: why setting summit= FALSE leads to a GRanges object with length of 48229, should not it be the same length as "consensusPeaks" of length 52772?
Thank you in advance for help/suggestions.
I'm still not clear on where exactly
Tet1consensusPeaks
comes from in your original code. Is it the same asconsensusPeaks
?This may have something to do with the
_random
contigs. I'm pretty sure that the 'bRemoveRandom' parameter is ignored when retrieving the consensus peakset, so it could be that the difference between 52772 and 48923 sites are the ones on chrM and/or _random contigs. If you sent me access to yourTet1.BL
object (and possibleTet1consensusPeaks
if it was derived from something other thanTet1.BL
), I could have a look to see what sites are being merged/removed.Regarding
summits
, besides making the peak widths more uniform, centering on the point of maximal consensus enrichment increases the promotion of the bases included in the analysis that represent enriched areas. Including excess background signal, as is more predominant at the outskirts of the intervals, lessens the accuracy of both the normalization and analysis steps.