DiffBind in depth explanation
Entering edit mode
jrp208 • 0
Last seen 2.2 years ago


I want to start by saying that I am still relatively green when it comes to understanding the intricacies of R packages. I have read through the DiffBind package vignette a few times, and have tried going through the source code to help me, but I am still trying to understand the meaning of some of the plots that come out of the DiffBind package, namely plotHeatmap() and dba.plotPCA(). I can run through a pipeline of the package without issue, but I really want to understand what is happening behind the scenes. I want to know what features are being used for the clustering.

For example with the heatmaps:

tamoxifen <- dba(sampleSheet="tamoxifen.csv",
+            dir=system.file("extra", package="DiffBind"))

The plot that is produced is: First Heatmap

and when I run this code:

tamoxifen_counts <- dba.count(tamoxifen, summits=250)

The plots that is produced is: Second Heatmap

What are the different features that each function is using while clustering? For both the correlation heatmap and for PCA? I'm looking for a more in depth explanation, not just what is stated in the vignette.

Thank you!



R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.6 (Maipo)

Matrix products: default
BLAS: /usr/local/gcc-6_3_0/lapack/3.7.0/lib/libblas.so.3.7.0
LAPACK: /usr/local/gcc-6_3_0/lapack/3.7.0/lib/liblapack.so.3.7.0

 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                 
 [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
 [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
 [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8     

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] XLConnect_0.2-15            XLConnectJars_0.2-15       
 [3] DiffBind_2.10.0             SummarizedExperiment_1.12.0
 [5] DelayedArray_0.8.0          BiocParallel_1.16.6        
 [7] matrixStats_0.53.1          Biobase_2.40.0             
 [9] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
[11] IRanges_2.16.0              S4Vectors_0.20.1           
[13] BiocGenerics_0.28.0        

loaded via a namespace (and not attached):
 [1] Category_2.48.1          bitops_1.0-6             bit64_0.9-7             
 [4] RColorBrewer_1.1-2       progress_1.2.0           httr_1.4.0              
 [7] Rgraphviz_2.26.0         backports_1.1.2          tools_3.5.0             
[10] R6_2.2.2                 KernSmooth_2.23-15       DBI_1.0.0               
[13] lazyeval_0.2.1           colorspace_1.3-2         tidyselect_0.2.5        
[16] prettyunits_1.0.2        bit_1.1-13               compiler_3.5.0          
[19] sendmailR_1.2-1          graph_1.60.0             rtracklayer_1.42.2      
[22] checkmate_1.8.5          caTools_1.17.1.2         scales_0.5.0            
[25] BatchJobs_1.8            genefilter_1.64.0        RBGL_1.58.2             
[28] stringr_1.3.1            digest_0.6.15            Rsamtools_1.34.1        
[31] AnnotationForge_1.24.0   XVector_0.22.0           base64enc_0.1-3         
[34] pkgconfig_2.0.1          limma_3.36.1             rlang_0.4.5             
[37] RSQLite_2.1.1            BBmisc_1.11              GOstats_2.48.0          
[40] hwriter_1.3.2            gtools_3.8.1             dplyr_0.8.5             
[43] RCurl_1.95-4.10          magrittr_1.5             GO.db_3.6.0             
[46] GenomeInfoDbData_1.2.0   Matrix_1.2-14            Rcpp_1.0.4              
[49] munsell_0.4.3            stringi_1.2.2            edgeR_3.22.1            
[52] zlibbioc_1.28.0          gplots_3.0.3             plyr_1.8.4              
[55] grid_3.5.0               blob_1.1.1               ggrepel_0.8.2           
[58] gdata_2.18.0             crayon_1.3.4             lattice_0.20-35         
[61] Biostrings_2.50.2        splines_3.5.0            GenomicFeatures_1.34.8  
[64] annotate_1.58.0          hms_0.4.2                locfit_1.5-9.1          
[67] pillar_1.4.3             rjson_0.2.20             systemPipeR_1.16.1      
[70] biomaRt_2.38.0           XML_3.98-1.11            glue_1.3.0              
[73] ShortRead_1.40.0         latticeExtra_0.6-28      data.table_1.11.2       
[76] gtable_0.2.0             purrr_0.2.5              amap_0.8-11             
[79] assertthat_0.2.0         ggplot2_3.1.0            xtable_1.8-2            
[82] survival_2.42-3          tibble_2.1.3             pheatmap_1.0.12         
[85] rJava_0.9-11             GenomicAlignments_1.18.1 AnnotationDbi_1.44.0    
[88] memoise_1.1.0            brew_1.0-6               GSEABase_1.44.0      
DiffBind PCA Differential Analysis • 203 views
Entering edit mode
Rory Stark ★ 4.4k
Last seen 7 days ago
CRUK, Cambridge, UK

In the first case, the correlation heatmap is based on occupancy data. A sample will only have a score for a peak if that peak was called by the peak caller in that sample. In the second case, count data is used. Every sample will have read count score for every peak, regardless of whether it was "called" as a peak in that sample. So if there are regions that had some enrichment in a given sample, but not enough to be called a peak, they will have positive count scores in the second case but not the first case, and may correlate more highly with the other samples.

This is covered in the Vignette.


Login before adding your answer.

Traffic: 374 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6