Normalization in DiffBind for Atac-seq
1
2
Entering edit mode
Debayan ▴ 20
@debayan-24308
Last seen 13 months ago
Spain

Hi, I have been using DiffBind_3.0.7 for determining Differential accessible sites. I have CT and KD samples, 3 replicates each, and we expect a large scale opening of the chromatin in the KD. After calling Peaks with MACS2 and running the DiffBind steps, I used 3 different Normalization commands to see how or whether they differ significantly in terms of numbers of differentially open sites . I read the Normalization part of your Vignette, and it mentions that one of the recommendations for atac-seq is actually the second option I use. I was wondering if the first option would be a valid way. Also, a full library size normalization ( option 3 ) should be similar to option 2, but I get much more sites. Would you recommend I stick with Option 2 ? I am attaching the MA plots ( Options 1,2 and 3 )

Also, if in a situation where I do no expect widespread changes of chromatin accessibility, would you recommend RLE on Reads in Peaks normalization?

Thank you, Debayan

Option 1 Option 2 Option 3 Option 2 UnNormalized

Code should be placed in three backticks as shown below


mod_atacseq_data.norm1 <- dba.normalize(mod_atacseq_data, method=DBA_DESEQ2, normalize=DBA_NORM_LIB, library=DBA_LIBSIZE_BACKGROUND, background=TRUE)
mod_atacseq_data.norm2 <- dba.normalize(mod_atacseq_data, method=DBA_DESEQ2, normalize=DBA_NORM_NATIVE, library=DBA_LIBSIZE_BACKGROUND, background=TRUE)
mod_atacseq_data.norm3 <- dba.normalize(mod_atacseq_data, method=DBA_DESEQ2, normalize=DBA_NORM_LIB, library=DBA_LIBSIZE_FULL, background=FALSE)

#Contrasts#
mod_atacseq_data.norm1.ctst <- dba.contrast(mod_atacseq_data.norm1, minMembers=3, categories=DBA_CONDITION)
mod_atacseq_data.norm2.ctst <- dba.contrast(mod_atacseq_data.norm2, minMembers=3, categories=DBA_CONDITION)
mod_atacseq_data.norm3.ctst <- dba.contrast(mod_atacseq_data.norm3, minMembers=3, categories=DBA_CONDITION)

#Differential Analysis#
mod_atacseq_data.results1 <- dba.analyze(mod_atacseq_data.norm1.ctst, method=DBA_DESEQ2)
mod_atacseq_data.results2 <- dba.analyze(mod_atacseq_data.norm2.ctst, method=DBA_DESEQ2)
mod_atacseq_data.results3 <- dba.analyze(mod_atacseq_data.norm3.ctst, method=DBA_DESEQ2)

mod_atacseq_data.DB1 <- dba.report(mod_atacseq_data.results1, method=DBA_DESEQ2, bCalled=T, bCounts=T)
mod_atacseq_data.DB2 <- dba.report(mod_atacseq_data.results2, method=DBA_DESEQ2, bCalled=T, bCounts=T)
mod_atacseq_data.DB3 <- dba.report(mod_atacseq_data.results3, method=DBA_DESEQ2, bCalled=T, bCounts=T)

bindingdata1=as.data.frame(mod_atacseq_data.DB1)
bindingdata2=as.data.frame(mod_atacseq_data.DB2)
bindingdata3=as.data.frame(mod_atacseq_data.DB3)


 sum(bindingdata1$Fold>0)
4882
> sum(bindingdata1$Fold<0)
54

> sum(bindingdata2$Fold>0)
4806
> sum(bindingdata2$Fold<0)
42

> sum(bindingdata3$Fold>0)
6543
> sum(bindingdata3$Fold<0)
63

dba.plotMA(mod_atacseq_data.results1)

dba.plotMA(mod_atacseq_data.results2)

dba.plotMA(mod_atacseq_data.results3)




sessionInfo( )
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.3.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] csaw_1.24.3                 DiffBind_3.0.7              SummarizedExperiment_1.20.0
 [4] MatrixGenerics_1.2.0        matrixStats_0.57.0          Hmisc_4.4-2                
 [7] Formula_1.2-4               survival_3.1-12             lattice_0.20-41            
[10] impute_1.64.0               EnsDb.Hsapiens.v79_2.99.0   ensembldb_2.14.0           
[13] AnnotationFilter_1.14.0     GenomicFeatures_1.42.1      AnnotationDbi_1.52.0       
[16] Biobase_2.50.0              GenomicRanges_1.42.0        GenomeInfoDb_1.26.1        
[19] IRanges_2.24.0              S4Vectors_0.28.0            BiocGenerics_0.36.0        
[22] forcats_0.5.0               purrr_0.3.4                 readr_1.4.0                
[25] tidyr_1.1.2                 tidyverse_1.3.0             data.table_1.13.2          
[28] stringr_1.4.0               dplyr_1.0.2                 ggfortify_0.4.11           
[31] ggplot2_3.3.2               tibble_3.0.4                RColorBrewer_1.1-2         
[34] pheatmap_1.0.12             readxl_1.3.1               

loaded via a namespace (and not attached):
  [1] tidyselect_1.1.0         RSQLite_2.2.1            htmlwidgets_1.5.2        grid_4.0.2              
  [5] BiocParallel_1.24.1      munsell_0.5.0            base64url_1.4            systemPipeR_1.24.2      
  [9] withr_2.3.0              colorspace_2.0-0         Category_2.56.0          knitr_1.30              
 [13] rstudioapi_0.13          bbmle_1.0.23.1           GenomeInfoDbData_1.2.4   mixsqp_0.3-43           
 [17] hwriter_1.3.2            bit64_4.0.5              coda_0.19-4              batchtools_0.9.14       
 [21] vctrs_0.3.5              generics_0.1.0           xfun_0.19                BiocFileCache_1.14.0    
 [25] R6_2.5.0                 apeglm_1.12.0            invgamma_1.1             locfit_1.5-9.4          
 [29] rsvg_2.1                 bitops_1.0-6             DelayedArray_0.16.0      assertthat_0.2.1        
 [33] scales_1.1.1             nnet_7.3-14              gtable_0.3.0             rlang_0.4.9             
 [37] genefilter_1.72.0        splines_4.0.2            rtracklayer_1.50.0       lazyeval_0.2.2          
 [41] broom_0.7.2              brew_1.0-6               checkmate_2.0.0          BiocManager_1.30.10     
 [45] yaml_2.2.1               modelr_0.1.8             backports_1.2.0          RBGL_1.66.0             
 [49] tools_4.0.2              ellipsis_0.3.1           gplots_3.1.1             Rcpp_1.0.5              
 [53] plyr_1.8.6               base64enc_0.1-3          progress_1.2.2           zlibbioc_1.36.0         
 [57] RCurl_1.98-1.2           prettyunits_1.1.1        rpart_4.1-15             openssl_1.4.3           
 [61] ashr_2.2-47              haven_2.3.1              ggrepel_0.8.2            cluster_2.1.0           
 [65] fs_1.5.0                 magrittr_2.0.1           reprex_0.3.0             truncnorm_1.0-8         
 [69] mvtnorm_1.1-1            SQUAREM_2020.5           amap_0.8-18              ProtGenerics_1.22.0     
 [73] hms_0.5.3                xtable_1.8-4             XML_3.99-0.5             emdbook_1.3.12          
 [77] jpeg_0.1-8.1             gridExtra_2.3            compiler_4.0.2           biomaRt_2.46.0          
 [81] bdsmatrix_1.3-4          KernSmooth_2.23-17       V8_3.4.0                 crayon_1.3.4            
 [85] htmltools_0.5.0          GOstats_2.56.0           geneplotter_1.68.0       lubridate_1.7.9.2       
 [89] DBI_1.1.0                dbplyr_2.0.0             MASS_7.3-51.6            rappdirs_0.3.1          
 [93] ShortRead_1.48.0         Matrix_1.2-18            cli_2.2.0                pkgconfig_2.0.3         
 [97] GenomicAlignments_1.26.0 numDeriv_2016.8-1.1      foreign_0.8-80           xml2_1.3.2              
[101] annotate_1.68.0          XVector_0.30.0           AnnotationForge_1.32.0   rvest_0.3.6             
[105] VariantAnnotation_1.36.0 digest_0.6.27            graph_1.68.0             Biostrings_2.58.0       
[109] cellranger_1.1.0         htmlTable_2.1.0          edgeR_3.32.0             GSEABase_1.52.0         
[113] GreyListChIP_1.22.0      curl_4.3                 Rsamtools_2.6.0          gtools_3.8.2            
[117] rjson_0.2.20             lifecycle_0.2.0          jsonlite_1.7.1           askpass_1.1             
[121] limma_3.46.0             BSgenome_1.58.0          fansi_0.4.1              pillar_1.4.7            
[125] httr_1.4.2               GO.db_3.12.1             glue_1.4.2               png_0.1-7               
[129] bit_4.0.4                Rgraphviz_2.34.0         stringi_1.5.3            blob_1.2.1              
[133] DESeq2_1.30.0            latticeExtra_0.6-29      caTools_1.18.0           memoise_1.1.0           
[137] DOT_0.1                  irlba_2.3.3
DiffBind • 2.5k views
ADD COMMENT
0
Entering edit mode

Please show the MA-plots, you can use the image button to upload images (the one right of the Code button).

ADD REPLY
0
Entering edit mode

Sorry, I have just uploaded them. Thanks!

ADD REPLY
0
Entering edit mode

Just my two cents: Basically these plots look all the same. I would even argue that they are all offset towards the LX condition which is probably more of a normalization artifact than true biology. The data look kind of symmetric like in any other ATAC-seq experiment I've seen, supported by the local regression trend, and the fact that the bulk of the very high-count regions (far right) is also offset towards LX makes it even more questionable whether this is really true biology in terms of a general shift. I would strongly recommend to explore alternative normalization strategies and also check for confounders, e.g. FRiPs so whether data quality is notable different between data. See e.g. the csaw vignette for suggestions. I am not a DiffBind user so not sure what exactly the above strategies do, therefore suggesting csaw to explore data (edit:) as an independent approach.

ADD REPLY
0
Entering edit mode

Also, it would be good to see the non-normalized MA plot!

ADD REPLY
0
Entering edit mode

Thanks a lot for your suggestion, I will try csaw as an independent approach. I have attached the non-normalized MA plot. There does appear to be deviation towards the y=0 line upon normalization, but as you mention there is a definite offset towards LX in the normalized MA plot

ADD REPLY
2
Entering edit mode
Rory Stark ★ 5.1k
@rory-stark-5741
Last seen 6 days ago
Cambridge, UK

Interesting questions!

I'll start with the last question:

Also, if in a situation where I do no expect widespread changes of chromatin accessibility, would you recommend RLE on Reads in Peaks normalization?

The issue is less how widespread the changes are, rather how balanced they are. In the case you are showing, there appears to be a consistent shift to more open chromatin regions in the LX condition, so you would definitely want to avoid using Reads-in-Peaks! at this point, I think you would need to have a good rationale for using RiP to normalize this type of data, regardless of your expectations. I'm finding it more and more difficult to rationalize ever using RiP normalization for the types of experimental data people use DiffBind to analyze.

That said, I think you've done well with the second option (RLE normalization on background windows, assuming you are using DESeq2 as the analysis method). In this case I'm not sure that "more is better" in terms of identifying differential open chromatin.You may want to look at some of the additional differential sites in the full library size case (ie in a genome browser) to see if they are clearly different so you can determine if you want to include them in your downstream analysis.

The full-library-size based normalization is really a proxy for a background-based one which should give better results, especially since a) you can use RLE to perform a bit more subtle calculation of normalization factors and b) the RLE and Library methods appear to "agree" well when using the background windows.

ADD COMMENT
0
Entering edit mode

Hi Rory, Thank you very much for your insights.

ADD REPLY

Login before adding your answer.

Traffic: 1033 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6