Question

DiffBind Occupancy analysis

0

Entering edit mode

faiza • 0

@d12c1880

Last seen 3.3 years ago

Switzerland

Enter the body of text here

1: How to extract the counts of each replicate after dba.count while doing the occupancy analysis? I would like develop a genomic ranges object that consist of consensus peaks (chr, start, end,) and in addition to score two column showing the normalized counts of each replicate in only these consensus peak set.
2: Also, why counts(48923) are different than peak set (52772)?

Below is my code. Thanks a lot for the help!

# Data consist of only one condition with two replicates. First, made a consensus peak set. 
Tet1_consensus <- dba.peakset(Tet1.BL, consensus= c(DBA_TISSUE, DBA_FACTOR), minOverlap=2)
Tet1_consensus <- dba(Tet1_consensus, mask=Tet1_consensus$masks$Consensus)
consensusPeaks <- dba.peakset(Tet1_consensus, bRetrieve = TRUE,bRemoveM = TRUE, bRemoveRandom= TRUE)
Tet1.count <- dba.count(Tet1.BL, peaks=Tet1consensusPeaks, summits=200,bUseSummarizeOverlaps=TRUE, score= "DBA_SCORE_NORMALIZED",bParallel= Tet1.BL$config$RunParallel)
Tet1.count

# RESULTS
> Tet1_consensus
3 Samples, 52772 sites in matrix (182790 total):
                ID Tissue Factor Condition Replicate Intervals
1 Tet1_wt_24h_rep1     wt wt_24h       24h         1     82980
2 Tet1_wt_24h_rep2     wt wt_24h       24h         2    157641
3              ALL     wt wt_24h       24h       1-2     52772

> consensusPeaks
GRanges object with 52772 ranges and 1 metadata column:
                    seqnames            ranges strand |        ALL
                       <Rle>         <IRanges>  <Rle> |  <numeric>
      1                 chr1   3264111-3264373      * | 0.00559169
      2                 chr1   3344674-3344984      * | 0.00960388
      3                 chr1   3531530-3532044      * | 0.00877813
      4                 chr1   3532986-3533302      * | 0.00621673
      5                 chr1   3556829-3557065      * | 0.00428748
    ...                  ...               ...    ... .        ...
  52768                 chrY   5364048-5364360      * | 0.00425416
  52769                 chrY   9769327-9769664      * | 0.00823932
  52770                 chrY   9820370-9820976      * | 0.01097052
  52771                 chrY 10165336-10166194      * | 0.00799716
  52772 chrY_JH584303_random       31923-32397      * | 0.00740656

Tet1.count 
2 Samples, 48923 sites in matrix:
                ID Tissue Factor Condition Replicate    Reads FRiP
1 Tet1_wt_24h_rep1     wt wt_24h       24h         1 32518176 0.09
2 Tet1_wt_24h_rep2     wt wt_24h       24h         2 25186571 0.10

sessionInfo( )
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)



locale:
[1] en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] LSD_4.1-0                   DiffBind_3.2.5              SummarizedExperiment_1.22.0
 [4] Biobase_2.52.0              MatrixGenerics_1.4.0        matrixStats_0.58.0         
 [7] GenomicRanges_1.44.0        GenomeInfoDb_1.28.0         IRanges_2.26.0             
[10] S4Vectors_0.30.0            BiocGenerics_0.38.0        

loaded via a namespace (and not attached):
  [1] GOstats_2.58.0           backports_1.2.1          BiocFileCache_2.0.0      plyr_1.8.6              
  [5] GSEABase_1.54.0          splines_4.1.0            BiocParallel_1.26.0      ggplot2_3.3.5           
  [9] amap_0.8-18              digest_0.6.27            invgamma_1.1             htmltools_0.5.1.1       
 [13] GO.db_3.13.0             SQUAREM_2021.1           fansi_0.5.0              magrittr_2.0.1          
 [17] checkmate_2.0.0          memoise_2.0.0            BSgenome_1.60.0          base64url_1.4           
 [21] limma_3.48.0             Biostrings_2.60.0        annotate_1.70.0          systemPipeR_1.26.3      
 [25] bdsmatrix_1.3-4          prettyunits_1.1.1        jpeg_0.1-8.1             colorspace_2.0-1        
 [29] blob_1.2.1               rappdirs_0.3.3           apeglm_1.14.0            ggrepel_0.9.1           
 [33] xfun_0.23                dplyr_1.0.7              crayon_1.4.1             RCurl_1.98-1.3          
 [37] jsonlite_1.7.2           graph_1.70.0             genefilter_1.74.0        brew_1.0-6              
 [41] survival_3.2-11          VariantAnnotation_1.38.0 glue_1.4.2               gtable_0.3.0            
 [45] zlibbioc_1.38.0          XVector_0.32.0           DelayedArray_0.18.0      V8_3.4.2                
 [49] Rgraphviz_2.36.0         scales_1.1.1             pheatmap_1.0.12          mvtnorm_1.1-1           
 [53] DBI_1.1.1                edgeR_3.34.0             Rcpp_1.0.6               xtable_1.8-4            
 [57] progress_1.2.2           emdbook_1.3.12           bit_4.0.4                rsvg_2.1.2              
 [61] truncnorm_1.0-8          AnnotationForge_1.34.0   httr_1.4.2               gplots_3.1.1            
 [65] RColorBrewer_1.1-2       ellipsis_0.3.2           pkgconfig_2.0.3          XML_3.99-0.6            
 [69] sass_0.4.0               dbplyr_2.1.1             locfit_1.5-9.4           utf8_1.2.1              
 [73] tidyselect_1.1.1         rlang_0.4.11             AnnotationDbi_1.54.0     munsell_0.5.0           
 [77] tools_4.1.0              cachem_1.0.5             generics_0.1.0           RSQLite_2.2.7           
 [81] evaluate_0.14            stringr_1.4.0            fastmap_1.1.0            yaml_2.2.1              
 [85] knitr_1.33               bit64_4.0.5              caTools_1.18.2           purrr_0.3.4             
 [89] KEGGREST_1.32.0          RBGL_1.68.0              biomaRt_2.48.0           compiler_4.1.0          
 [93] rstudioapi_0.13          filelock_1.0.2           curl_4.3.1               png_0.1-7               
 [97] geneplotter_1.70.0       tibble_3.1.2             bslib_0.2.5.1            stringi_1.6.2           
[101] GenomicFeatures_1.44.0   lattice_0.20-44          Matrix_1.3-3             vctrs_0.3.8             
[105] pillar_1.6.1             lifecycle_1.0.0          BiocManager_1.30.15      jquerylib_0.1.4         
[109] irlba_2.3.3              data.table_1.14.0        bitops_1.0-7             rtracklayer_1.52.0      
[113] R6_2.5.0                 BiocIO_1.2.0             latticeExtra_0.6-29      hwriter_1.3.2           
[117] ShortRead_1.50.0         KernSmooth_2.23-20       MASS_7.3-54              gtools_3.8.2            
[121] assertthat_0.2.1         DESeq2_1.32.0            Category_2.58.0          rjson_0.2.20            
[125] withr_2.4.2              GenomicAlignments_1.28.0 batchtools_0.9.15        Rsamtools_2.8.0         
[129] GenomeInfoDbData_1.2.6   hms_1.1.0                grid_4.1.0               DOT_0.1                 
[133] coda_0.19-4              rmarkdown_2.8            GreyListChIP_1.24.0      ashr_2.2-47             
[137] mixsqp_0.3-43            bbmle_1.0.23.1           numDeriv_2016.8-1.1      restfulr_0.0.13

DiffBind • 893 views

ADD COMMENT • link updated 3.3 years ago by Rory Stark ★ 5.2k • written 3.3 years ago by faiza • 0

score 0 · Answer 1 · 2021-08-13

0

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 5 weeks ago

Cambridge, UK

Try calling dba.peakset(Tet1.count, bRetrieve=TRUE) (after calling dba.count()) to get the GRanges object you are interested in.

Assuming Tet1consensusPeaks and consensusPeaks are the same object in your example, I suspect the reason you are seeing fewer peaks after counting reads is because of the value for summits, which causes peaks to be re-centered and made a consistent width (in your case 401bp each). you are showing that some of the peaks in consensusPeaks are narrower than 401bp, so after they are expanded to 401bp, some may then overlap and get merged, resulting in fewer (merged) peaks. You may consider a lower value for summits, such as summits=100.

ADD COMMENT • link 3.3 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Dear Rory, thanks for the response. Following your suggestion, i can now get the count of both replicates.

However, the length issue stays; Tet1consensusCount is still not the same as consensusPeaks even if i change the summit=50, or summit= FALSE.

Tet1.count <- dba.count(Tet1.BL, peaks=Tet1consensusPeaks, summits=FALSE,score=DBA_SCORE_NORMALIZED, bUseSummarizeOverlaps=TRUE) Tet1.count consensusCount <- dba.peakset(Tet1.count, bRetrieve=TRUE) GRanges object with 48229 ranges and 2 metadata columns: seqnames ranges strand | Tet1_wt_24h_rep1 Tet1_wt_24h_rep2 <Rle> <IRanges> <Rle> | <numeric> <numeric> 1 chr1 3264111-3264373 | 9.75996 22.9109 2 chr1 3531530-3532044 | 21.29446 53.8407 3 chr1 3532986-3533302 | 14.19631 18.3287 4 chr1 3556829-3557065 | 9.75996 12.6010 5 chr1 3593397-3593597 * | 3.54908 12.6010

Question now is, 1: Since its an occupancy analysis, I need these GRnages to overlap with other TF ChIPseq data sets. Should I use GRnages of "consensusPeaks" or GRnages of consensusCount. The later gives me read count of each replicate which is meaningful in some downstream analysis. 2: I may have totally misunderstood but I am confused whether one should use summit or not at all? what is the benefit beside it gives the GRanges of the same length?
3: why setting summit= FALSE leads to a GRanges object with length of 48229, should not it be the same length as "consensusPeaks" of length 52772?

Thank you in advance for help/suggestions.

ADD REPLY • link 3.3 years ago faiza • 0

0

Entering edit mode

I'm still not clear on where exactly Tet1consensusPeaks comes from in your original code. Is it the same as consensusPeaks?

This may have something to do with the _random contigs. I'm pretty sure that the 'bRemoveRandom' parameter is ignored when retrieving the consensus peakset, so it could be that the difference between 52772 and 48923 sites are the ones on chrM and/or _random contigs. If you sent me access to your Tet1.BL object (and possible Tet1consensusPeaks if it was derived from something other than Tet1.BL), I could have a look to see what sites are being merged/removed.

Regarding summits, besides making the peak widths more uniform, centering on the point of maximal consensus enrichment increases the promotion of the bases included in the analysis that represent enriched areas. Including excess background signal, as is more predominant at the outskirts of the intervals, lessens the accuracy of both the normalization and analysis steps.

ADD REPLY • link 3.3 years ago Rory Stark ★ 5.2k