DiffBind returns different grey lists every time I run dba.blacklist even if I set seed.
Entering edit mode
r.murakami • 0
Last seen 26 days ago


I ran DiffBind dba.blacklist function to get greylist. Then I noticed that DiffBind returns different grey lists every time I ran dba.blacklist even if I set seed as follows (set.seed(42)).

If there is a hint to solve this problem, please let me know. Thank you.

>ATAC = dba(sampleSheet="path_to_samplesheet")
1 A ER Primary X 1 bed
2 A ER Primary X 2 bed
3 B ER Primary X 1 bed
4 B ER Primary X 2 bed
>df <- dba.blacklist(ATAC)
Genome detected: Mmusculus.UCSC.mm10
Applying blacklist...
Removed: 4199 of 112981 intervals.
Counting control reads for greylist...
Building greylist: path_to_bam1
**coverage: 4371607 bp (0.16%)**
Building greylist: path_to_bam2
coverage: 37162477 bp (1.36%)
A_Input: 1221 ranges, 4371607 bases
B_Input: 7776 ranges, 37162477 bases
**Master greylist: 8028 ranges, 38776471 bases**
Removed: 52849 of 108782 intervals.
Removed: 16867 merged (of 45345) and 13500 (of 28363) consensus.
>df <- dba.blacklist(ATAC)
Genome detected: Mmusculus.UCSC.mm10
Applying blacklist...
Removed: 4199 of 112981 intervals.
Counting control reads for greylist...
Building greylist: path_to_bam1
**coverage: 28212375 bp (1.03%)**
Building greylist: path_to_bam2
coverage: 37162477 bp (1.36%)
A_Input: 6688 ranges, 28212375 bases
B_Input: 7776 ranges, 37162477 bases
**Master greylist: 10857 ranges, 53424791 bases**
Removed: 58656 of 108782 intervals.
Removed: 19151 merged (of 45345) and 15094 (of 28363) consensus.

sessionInfo( )
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /usr/local/package/r/4.1.0/lib64/R/lib/libRblas.so
LAPACK: /usr/local/package/r/4.1.0/lib64/R/lib/libRlapack.so

 [1] LC_CTYPE=ja_JP.UTF-8       LC_NUMERIC=C               LC_TIME=ja_JP.UTF-8       
 [7] LC_PAPER=ja_JP.UTF-8       LC_NAME=C                  LC_ADDRESS=C              

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] sessioninfo_1.2.2           data.table_1.14.8           readr_2.1.4                
 [4] BiocParallel_1.28.3         DiffBind_3.4.11             SummarizedExperiment_1.24.0
 [7] Biobase_2.54.0              MatrixGenerics_1.6.0        matrixStats_1.0.0          
[10] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1         IRanges_2.28.0             
[13] S4Vectors_0.32.4            BiocGenerics_0.40.0        

loaded via a namespace (and not attached):
  [1] bitops_1.0-7             bit64_4.0.5              httr_1.4.7               RColorBrewer_1.1-3      
  [5] numDeriv_2016.8-1.1      tools_4.1.0              utf8_1.2.3               R6_2.5.1                
  [9] irlba_2.3.5.1            KernSmooth_2.23-20       DBI_1.1.3                colorspace_2.1-0        
 [13] apeglm_1.16.0            DESeq2_1.34.0            tidyselect_1.2.0         gridExtra_2.3           
 [17] bit_4.0.5                compiler_4.1.0           cli_3.6.1                DelayedArray_0.20.0     
 [21] rtracklayer_1.54.0       caTools_1.18.2           scales_1.2.1             SQUAREM_2021.1          
 [25] mvtnorm_1.2-3            genefilter_1.76.0        mixsqp_0.3-48            stringr_1.5.0           
 [29] digest_0.6.33            Rsamtools_2.10.0         XVector_0.34.0           jpeg_0.1-10             
 [33] pkgconfig_2.0.3          htmltools_0.5.6          fastmap_1.1.1            invgamma_1.1            
 [37] bbmle_1.0.25             limma_3.50.3             BSgenome_1.62.0          htmlwidgets_1.6.2       
 [41] rlang_1.1.1              RSQLite_2.3.1            rstudioapi_0.15.0        BiocIO_1.4.0            
 [45] generics_0.1.3           hwriter_1.3.2.1          gtools_3.9.4             dplyr_1.1.2             
 [49] RCurl_1.98-1.12          magrittr_2.0.3           GenomeInfoDbData_1.2.7   interp_1.1-4            
 [53] Matrix_1.6-1             Rcpp_1.0.11              munsell_0.5.0            fansi_1.0.4             
 [57] lifecycle_1.0.3          edgeR_3.36.0             stringi_1.7.12           yaml_2.3.7              
 [61] MASS_7.3-60              zlibbioc_1.40.0          gplots_3.1.3             plyr_1.8.8              
 [65] blob_1.2.4               grid_4.1.0               parallel_4.1.0           ggrepel_0.9.3           
 [69] bdsmatrix_1.3-6          crayon_1.5.2             deldir_1.0-9             lattice_0.21-8          
 [73] splines_4.1.0            Biostrings_2.62.0        annotate_1.72.0          KEGGREST_1.34.0         
 [77] hms_1.1.3                locfit_1.5-9.8           pillar_1.9.0             rjson_0.2.21            
 [81] systemPipeR_2.0.8        geneplotter_1.72.0       XML_3.99-0.14            glue_1.6.2              
 [85] ShortRead_1.52.0         GreyListChIP_1.26.0      latticeExtra_0.6-30      tzdb_0.4.0              
 [89] png_0.1-8                vctrs_0.6.3              gtable_0.3.4             amap_0.8-19             
 [93] cachem_1.0.8             ashr_2.2-63              ggplot2_3.4.3            emdbook_1.3.13          
 [97] xtable_1.8-4             restfulr_0.0.15          coda_0.19-4              survival_3.5-7          
[101] truncnorm_1.0-9          tibble_3.2.1             memoise_2.0.1            AnnotationDbi_1.56.2    
[105] GenomicAlignments_1.30.0
DiffBind • 976 views
Entering edit mode
Last seen 1 day ago
United States

Using .set.seed only sets the seed for the first call to any random number generation, after which the seed is incremented. As an example

> set.seed(42)
> z <- rnorm(10)
> set.seed(42)
> zz <- rnorm(10)
> all.equal(z, zz)
[1] TRUE
> set.seed(42)
> all.equal(rnorm(10), rnorm(10))
[1] "Mean relative difference: 2.392926"

So setting the seed one time and then running dba.blacklist multiple times will result in different seeds being used for each run.

Entering edit mode

Dear James,

Got it. Thank you for your information.

Entering edit mode
Rory Stark ★ 5.1k
Last seen 8 weeks ago
Cambridge, UK

Regardless of the random seed, it is concerning to see such variance in the generated greylist for path_to_bam1. I'll log this issue but I'm not sure when it will get looked at.

Entering edit mode

Hi @r.murakani,

As Rory said, we would not expect to see this kind of variance in the generated grey list and, having looked at some of our ChIP input samples, I see much more consistency in the regions that are grey-listed and the count thresholds that determine these.

DiffBind uses the GreyListChIP package to generate the grey list for your ChIP input samples. GreyListChIP does so by counting reads within overlapping 1024bp tiles across the genome. It then samples what is rather large number of counts several times (100 repetitions of sample size 30000) and fits a negative binomial distribution to each sample. It uses the average mu and size parameters from the fitted distributions and computes a threshold count value with a specified p-value (DiffBind uses p = 0.999).

I suspect that your distribution of counts looks quite different to what we normally see for a ChIP input sample, perhaps with the majority of your reads concentrated in a few regions of the genome. You can generate these counts directly using GreyListChIP and plot the distribution as follows:


gr <- as(seqinfo(BamFile("my_input.bam")), "GRanges")

gl <- new("GreyList", regions = gr)
gl <- countReads(gl, bam_file)

counts <- data.frame(count = gl@counts)

write.csv(counts, file = "counts.csv", row.names = FALSE, quote = FALSE)

ggplot(counts, aes(x = count)) + geom_density() + xlim(c(0, 100))

I'd be happy to take a look if you want to send me the counts.csv file.


Login before adding your answer.

Traffic: 613 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6