Question

DiffBind in depth explanation

0

Entering edit mode

jrp208 • 0

@jrp208-23094

Last seen 5.9 years ago

Hello,

I want to start by saying that I am still relatively green when it comes to understanding the intricacies of R packages. I have read through the DiffBind package vignette a few times, and have tried going through the source code to help me, but I am still trying to understand the meaning of some of the plots that come out of the DiffBind package, namely plotHeatmap() and dba.plotPCA(). I can run through a pipeline of the package without issue, but I really want to understand what is happening behind the scenes. I want to know what features are being used for the clustering.

For example with the heatmaps:

tamoxifen <- dba(sampleSheet="tamoxifen.csv",
+            dir=system.file("extra", package="DiffBind"))
plot(tamoxifen)

The plot that is produced is: First Heatmap

and when I run this code:

tamoxifen_counts <- dba.count(tamoxifen, summits=250)
plot(tamoxifen_counts)

The plots that is produced is: Second Heatmap

What are the different features that each function is using while clustering? For both the correlation heatmap and for PCA? I'm looking for a more in depth explanation, not just what is stated in the vignette.

Thank you!

Joseph

sessionInfo():

sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.6 (Maipo)

Matrix products: default
BLAS: /usr/local/gcc-6_3_0/lapack/3.7.0/lib/libblas.so.3.7.0
LAPACK: /usr/local/gcc-6_3_0/lapack/3.7.0/lib/liblapack.so.3.7.0

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                 
 [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
 [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
 [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8     
[11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] XLConnect_0.2-15            XLConnectJars_0.2-15       
 [3] DiffBind_2.10.0             SummarizedExperiment_1.12.0
 [5] DelayedArray_0.8.0          BiocParallel_1.16.6        
 [7] matrixStats_0.53.1          Biobase_2.40.0             
 [9] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
[11] IRanges_2.16.0              S4Vectors_0.20.1           
[13] BiocGenerics_0.28.0        

loaded via a namespace (and not attached):
 [1] Category_2.48.1          bitops_1.0-6             bit64_0.9-7             
 [4] RColorBrewer_1.1-2       progress_1.2.0           httr_1.4.0              
 [7] Rgraphviz_2.26.0         backports_1.1.2          tools_3.5.0             
[10] R6_2.2.2                 KernSmooth_2.23-15       DBI_1.0.0               
[13] lazyeval_0.2.1           colorspace_1.3-2         tidyselect_0.2.5        
[16] prettyunits_1.0.2        bit_1.1-13               compiler_3.5.0          
[19] sendmailR_1.2-1          graph_1.60.0             rtracklayer_1.42.2      
[22] checkmate_1.8.5          caTools_1.17.1.2         scales_0.5.0            
[25] BatchJobs_1.8            genefilter_1.64.0        RBGL_1.58.2             
[28] stringr_1.3.1            digest_0.6.15            Rsamtools_1.34.1        
[31] AnnotationForge_1.24.0   XVector_0.22.0           base64enc_0.1-3         
[34] pkgconfig_2.0.1          limma_3.36.1             rlang_0.4.5             
[37] RSQLite_2.1.1            BBmisc_1.11              GOstats_2.48.0          
[40] hwriter_1.3.2            gtools_3.8.1             dplyr_0.8.5             
[43] RCurl_1.95-4.10          magrittr_1.5             GO.db_3.6.0             
[46] GenomeInfoDbData_1.2.0   Matrix_1.2-14            Rcpp_1.0.4              
[49] munsell_0.4.3            stringi_1.2.2            edgeR_3.22.1            
[52] zlibbioc_1.28.0          gplots_3.0.3             plyr_1.8.4              
[55] grid_3.5.0               blob_1.1.1               ggrepel_0.8.2           
[58] gdata_2.18.0             crayon_1.3.4             lattice_0.20-35         
[61] Biostrings_2.50.2        splines_3.5.0            GenomicFeatures_1.34.8  
[64] annotate_1.58.0          hms_0.4.2                locfit_1.5-9.1          
[67] pillar_1.4.3             rjson_0.2.20             systemPipeR_1.16.1      
[70] biomaRt_2.38.0           XML_3.98-1.11            glue_1.3.0              
[73] ShortRead_1.40.0         latticeExtra_0.6-28      data.table_1.11.2       
[76] gtable_0.2.0             purrr_0.2.5              amap_0.8-11             
[79] assertthat_0.2.0         ggplot2_3.1.0            xtable_1.8-2            
[82] survival_2.42-3          tibble_2.1.3             pheatmap_1.0.12         
[85] rJava_0.9-11             GenomicAlignments_1.18.1 AnnotationDbi_1.44.0    
[88] memoise_1.1.0            brew_1.0-6               GSEABase_1.44.0

DiffBind PCA Differential Analysis • 1.5k views

ADD COMMENT • link updated 5.9 years ago by Rory Stark ★ 5.2k • written 5.9 years ago by jrp208 • 0

score 1 · Accepted Answer · 2020-03-19

In the first case, the correlation heatmap is based on occupancy data. A sample will only have a score for a peak if that peak was called by the peak caller in that sample. In the second case, count data is used. Every sample will have read count score for every peak, regardless of whether it was "called" as a peak in that sample. So if there are regions that had some enrichment in a given sample, but not enough to be called a peak, they will have positive count scores in the second case but not the first case, and may correlate more highly with the other samples.

This is covered in the Vignette.