Question

How sensitive is DiffBind to Deseq2 assumptions ?

0

Entering edit mode

Sam ▴ 10

@sam-21502

Last seen 2 days ago

Jerusalem

I thought it might be worthwhile to check how well the DiffBind analysis using DESeq2 fits the assumptions needed. The data comes from ChIP-Seq. I performed a DESeq2 analysis, using DiffBind fetched a DESeq2 object from its results, and drew the dispersion plot.

enter image description here

The fitted function has a "hump" meaning that the dispersion becomes higher with the mean for some portion of the graph.

I think that in an RNA-Seq one should be worried if such a graph would be produced. How worrisome is it in ChIP-Seq, in diffbind?

Also, since I cannot assume balanced changes in binding between the various conditions, I have normalized by the total library size - not by DESeq2 native normalization. So unsurprisingly, the distribution is different for the different samples. I think there is no other option - but are these invalidated assumptions that ruin the analysis?

enter image description here


dbObj_b_contrast_qc <- dba.contrast(dbObj_b_norm,design="~Condition")
dbObj_b_res_qc <- dba.analyze(dbObj_b_contrast_qc)
dds_qc <- dba.analyze(dbObj_b_res_qc, bRetrieveAnalysis = T)
plotDispEsts(dds_qc)
normc <- as.data.frame(counts(dds_qc, normalized=T))
myBoxplot(normc) # a function plotting boxplots with ggplot2

print(dds_qc$Condition)
sessionInfo()

Output:

[1] WT         WT         WT         Mut        Mut        Mut       
[7] Rescue     Rescue     Rescue_Mut Rescue_Mut Rescue_Mut
Levels: WT Mut Rescue Rescue_Mut



R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_IL.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_IL.UTF-8        LC_COLLATE=en_IL.UTF-8    
 [5] LC_MONETARY=en_IL.UTF-8    LC_MESSAGES=en_IL.UTF-8   
 [7] LC_PAPER=en_IL.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_IL.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] parallel  stats4    grid      stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] limma_3.46.0                rgl_0.105.13               
 [3] DiffBind_3.0.13             DESeq2_1.30.1              
 [5] SummarizedExperiment_1.20.0 Biobase_2.50.0             
 [7] MatrixGenerics_1.2.1        matrixStats_0.58.0         
 [9] GenomicRanges_1.42.0        GenomeInfoDb_1.26.2        
[11] IRanges_2.24.1              S4Vectors_0.28.1           
[13] BiocGenerics_0.36.0         RColorBrewer_1.1-2         
[15] pheatmap_1.0.12             ggrepel_0.9.1              
[17] ggplot2_3.3.3              

loaded via a namespace (and not attached):
  [1] backports_1.2.1          GOstats_2.56.0          
  [3] BiocFileCache_1.14.0     plyr_1.8.6              
  [5] GSEABase_1.52.1          splines_4.0.4           
  [7] crosstalk_1.1.1          BiocParallel_1.24.1     
  [9] amap_0.8-18              digest_0.6.27           
 [11] htmltools_0.5.1.1        invgamma_1.1            
 [13] GO.db_3.12.1             SQUAREM_2021.1          
 [15] fansi_0.4.2              magrittr_2.0.1          
 [17] checkmate_2.0.0          memoise_2.0.0           
 [19] BSgenome_1.58.0          base64url_1.4           
 [21] Biostrings_2.58.0        annotate_1.68.0         
 [23] systemPipeR_1.24.3       askpass_1.1             
 [25] bdsmatrix_1.3-4          prettyunits_1.1.1       
 [27] jpeg_0.1-8.1             colorspace_2.0-0        
 [29] blob_1.2.1               rappdirs_0.3.3          
 [31] apeglm_1.12.0            xfun_0.21               
 [33] dplyr_1.0.4              crayon_1.4.1            
 [35] RCurl_1.98-1.2           jsonlite_1.7.2          
 [37] graph_1.68.0             genefilter_1.72.1       
 [39] brew_1.0-6               survival_3.2-7          
 [41] VariantAnnotation_1.36.0 glue_1.4.2              
 [43] gtable_0.3.0             zlibbioc_1.36.0         
 [45] XVector_0.30.0           webshot_0.5.2           
 [47] DelayedArray_0.16.1      V8_3.4.0                
 [49] Rgraphviz_2.34.0         scales_1.1.1            
 [51] mvtnorm_1.1-1            DBI_1.1.1               
 [53] edgeR_3.32.1             miniUI_0.1.1.1          
 [55] Rcpp_1.0.6               xtable_1.8-4            
 [57] progress_1.2.2           emdbook_1.3.12          
 [59] bit_4.0.4                rsvg_2.1                
 [61] AnnotationForge_1.32.0   truncnorm_1.0-8         
 [63] htmlwidgets_1.5.3        httr_1.4.2              
 [65] gplots_3.1.1             ellipsis_0.3.1          
 [67] farver_2.0.3             pkgconfig_2.0.3         
 [69] XML_3.99-0.5             sass_0.3.1              
 [71] dbplyr_2.1.0             locfit_1.5-9.4          
 [73] utf8_1.1.4               labeling_0.4.2          
 [75] manipulateWidget_0.10.1  later_1.1.0.1           
 [77] tidyselect_1.1.0         rlang_0.4.10            
 [79] AnnotationDbi_1.52.0     munsell_0.5.0           
 [81] tools_4.0.4              cachem_1.0.4            
 [83] cli_2.3.0                generics_0.1.0          
 [85] RSQLite_2.2.3            evaluate_0.14           
 [87] stringr_1.4.0            fastmap_1.1.0           
 [89] yaml_2.2.1               knitr_1.31              
 [91] bit64_4.0.5              caTools_1.18.1          
 [93] purrr_0.3.4              RBGL_1.66.0             
 [95] mime_0.10                xml2_1.3.2              
 [97] biomaRt_2.46.3           compiler_4.0.4          
 [99] rstudioapi_0.13          curl_4.3                
[101] png_0.1-7                geneplotter_1.68.0      
[103] tibble_3.0.6             bslib_0.2.4             
[105] stringi_1.5.3            GenomicFeatures_1.42.1  
[107] lattice_0.20-41          Matrix_1.3-2            
[109] vctrs_0.3.6              pillar_1.5.0            
[111] lifecycle_1.0.0          jquerylib_0.1.3         
[113] irlba_2.3.3              data.table_1.14.0       
[115] bitops_1.0-6             httpuv_1.5.5            
[117] rtracklayer_1.50.0       R6_2.5.0                
[119] latticeExtra_0.6-29      hwriter_1.3.2           
[121] promises_1.2.0.1         ShortRead_1.48.0        
[123] KernSmooth_2.23-18       MASS_7.3-53.1           
[125] gtools_3.8.2             assertthat_0.2.1        
[127] openssl_1.4.3            Category_2.56.0         
[129] rjson_0.2.20             withr_2.4.1             
[131] GenomicAlignments_1.26.0 batchtools_0.9.15       
[133] Rsamtools_2.6.0          GenomeInfoDbData_1.2.4  
[135] hms_1.0.0                DOT_0.1                 
[137] coda_0.19-4              rmarkdown_2.7           
[139] GreyListChIP_1.22.0      ashr_2.2-47             
[141] mixsqp_0.3-43            bbmle_1.0.23.1          
[143] shiny_1.6.0              numDeriv_2016.8-1.1     
[145] tinytex_0.29

DESeq2 DiffBind • 983 views

ADD COMMENT • link updated 3.1 years ago by Rory Stark ★ 5.2k • written 3.1 years ago by Sam ▴ 10

score 0 · Answer 1 · 2021-03-21

0

Entering edit mode

ATpoint ★ 4.1k

@atpoint-13662

Last seen 12 hours ago

Germany

Just my two cents (not being DiffBind/DESeq2 developer): If your libraries are expected to be very different it is especially important to NOT use just plain per-million but something that attempts to correct for composition. I would inspect pairwise MA-plots between samples to see how the data look. Usually MAs have an arrowhead-like shape and the point on the very far are the regions with very large counts which often do not (or only very modestly) change, therefore they (so the "head" of the arrowhead) should be somewhat centered along y=0. You can then explore normalization techniques followed by MA-plots again and visual inspection which one centers these regions best at y=0. It might be a good idea to subset the data to only like the 25% regions with the highest counts (again = the regions farright on the MA) to calculate the size factors, and then feed them manually into your DESeq2 object. Not sure how this goes in DiffBind. I found this useful for some of my datasets (using DESeq2 directly) where composition was kind of impalanced between up- and downregulated regions, and using the full dataset for normalization did not properly center the aforementioned regions at y=0. As you usually have more points on the left than right of the MA-plots (in my understanding) these point have kind of more "weight" to influence normalization, and if these regions change a lot between samples then it might be necessary to perform the aforementioned subsetting to high-count regions assuming that these do not change / are robust estimators of non-changing regions. tl;dr, hope this makes sense to you.

ADD COMMENT • link 3.1 years ago ATpoint ★ 4.1k

0

Entering edit mode

Thanks. As far as I understand, in CHiP-Seq (unlike RNA-Seq) one can expect major changes between conditions. Whereas expression data has "housekeeping" genes, I think this concept does not apply to protein binding, and one can have large differences resulting from chromatin accessibility. So MA plots (and even their high counts) should not necessarily be on y=0, imho.

Here there is a post which asks "what is the best practice in DiffBind to deal with two data sets that have major changes in chromatin accessibility?". Hopefully the question will be answered there.

ADD REPLY • link 3.1 years ago Sam ▴ 10

0

Entering edit mode

That is basically the same. Normalization in these plots seems to be completely off, the OP should focus (as I explained above) on the "arrowhead" regions, as these probably are the most reliable ones. It seems that OP has lots of regions with increased counts in the 2dpr condition and this is setting off normalization. As said I would normalize based on the very highly high-count regions, or try something like bin-based normalization as suggested in the csaw vignette.

ADD REPLY • link 3.1 years ago ATpoint ★ 4.1k

0

Entering edit mode

This is a good conversation. Normalization of ChIP data is not obvious! While DiffBind has several built-in options for normalization options, the best one for your data may require computing the normalization factors outside of the package and then feeding them back in. Other options that would need to be considered upfront is the use of spike-ins, or ChIPing for parallel factors (such as H3K4me3) that work in a similar way to housekeeping genes in RNA-seq

I also recommend reading Aaron Lun's discussion of normalization in the csaw user guide; among other things he talks specifically about trended biases.

ADD REPLY • link 3.1 years ago Rory Stark ★ 5.2k