Question

Too much data for Diffbind

0

Entering edit mode

littlefishes20 ▴ 10

@littlefishes20-13054

Last seen 3 months ago

Taiwan

Hello,

Thanks for your effort in producing such a good tool for analysis.

Recently, I am using Diffbind to characterize differential open chromosome regions with ATAC-seq. I have more than 50 samples, equipping with a bam file ranging from 10GB-30GB. However, I found it really hard to handle such a huge amount of data for DiffBind. While I set 'bParallel=TRUE' for dba.count function, it was still too slow to process.

Even when I selected four samples for testing, the tool still broke with two different kinds of error.

library("DiffBind")
samples=read.csv("res.info.csv")
tamoxifen <- dba(sampleSheet=samples)
#sample1 bulk B1 treat1 lib 1 bed
#sample2 bulk B1 treat2 lib 1 bed
#sample3 bulk B1 treat3 lib 1 bed
#sample4 bulk B1 treat4 lib 1 bed
tamoxifen
#4 Samples, 135778 sites in matrix (188638 total):
#         ID Tissue Factor    Condition Treatment Replicate Intervals
#1   sample1   bulk     B1   treat1     lib         1    188126
#2 ...
tamoxifen <- dba.count(tamoxifen,bParallel=TRUE)
#Computing summits...
#Re-centering peaks...
#Reads will be counted as Paired-end.
#Reads will be counted as Paired-end.
#Reads will be counted as Paired-end.
#Reads will be counted as Paired-end.
Type1 error:
#Error: Error processing one or more read files. Check warnings().
#In addition: Warning messages:
#1: In mclapply(arglist, fn, ..., mc.preschedule = TRUE, mc.allow.recursive = TRUE) :
#  scheduled core 2 encountered error in user code, all values of the job will be affected
#2:   error in evaluating the argument 'x' in selecting a method for function 'assay': wrong args for environment subassignment
Type2 error:
#Re-centering peaks...
#Sample: ./input.bam125
#Reads will be counted as Paired-end.
#'BiocParallel' did not register default BiocParallelParam:
#  missing value where TRUE/FALSE needed
#Error in h(simpleError(msg, call)) :
#  error in evaluating the argument 'x' in selecting a method for function 'assay': error in evaluating the argument 'BPPARAM' in selecting a method for function 'bplapply': attempt to select less than one element in get1index

I am wondering if there is any suggestion for my case. Many thanks in advance!

Sincerely yours Xiaoyu

DiffBind • 4.1k views

ADD COMMENT • link updated 14 months ago by Ou, Jianhong ★ 1.3k • written 3.1 years ago by littlefishes20 ▴ 10

0

Entering edit mode

I encountered the similar issues, but solved by using:

dba.count(tamoxifen, fragmentSize=0, bUseSummarizeOverlaps=F, summits=0)

ADD REPLY • link 3.1 years ago yuhuihui2011 • 0

0

Entering edit mode

I think the key in your case is using summits=0. I would avoid setting bUseSummarizeOverlaps=F with paired end data.

ADD REPLY • link 3.1 years ago Rory Stark ★ 5.2k

0

Entering edit mode

I have paired-end reads, but only unique mapped reads with only single-end reads in bam files. There was a problem for me to set bUseSummarizeOverlaps=T. There were no errors when I changed bUseSummarizeOverlaps=T.

ADD REPLY • link 3.1 years ago yuhuihui2011 • 0

0

Entering edit mode

Can you tell me what version you are using? There was a bug a while back that resulted in this error message but that has been addressed.

For ATAC data, you may want to try a smaller value for summits, eg. summits=50.

It may be worth trying this with bParallel=FALSE to see if there is an issue running in parallel. You may also want to set the number of cores being used by setting tamoxifen$config$cores before to call to dba.count(). If you machine has a lot of cores available, but limited memory, you can run into trouble (for example if you end up using 50+ cores with not enough memory for each core).

ADD REPLY • link 3.1 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Dear Rory,

Thank you for providing this information. I'll make changes to the script based on your suggestions. Please take a look at the following information about my environment.

> sessionInfo(package = NULL)
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /share/app/R/4.1.1/lib64/R/lib/libRblas.so
LAPACK: /share/app/R/4.1.1/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] DiffBind_3.4.11             SummarizedExperiment_1.24.0
 [3] Biobase_2.54.0              MatrixGenerics_1.6.0
 [5] matrixStats_0.62.0          GenomicRanges_1.46.1
 [7] GenomeInfoDb_1.30.1         IRanges_2.28.0
 [9] S4Vectors_0.32.4            BiocGenerics_0.40.0

loaded via a namespace (and not attached):
 [1] gtools_3.9.3             mixsqp_0.3-43            latticeExtra_0.6-30
 [4] amap_0.8-18              BSgenome_1.62.0          GenomeInfoDbData_1.2.7
 [7] Rsamtools_2.10.0         yaml_2.3.6               ggrepel_0.9.1
[10] numDeriv_2016.8-1.1      pillar_1.8.1             lattice_0.20-44
[13] glue_1.6.2               limma_3.50.3             bbmle_1.0.25
[16] digest_0.6.30            RColorBrewer_1.1-3       XVector_0.34.0
[19] colorspace_2.0-3         htmltools_0.5.3          Matrix_1.3-4
[22] plyr_1.8.7               XML_3.99-0.11            pkgconfig_2.0.3
[25] invgamma_1.1             ShortRead_1.52.0         emdbook_1.3.12
[28] zlibbioc_1.40.0          mvtnorm_1.1-3            scales_1.2.1
[31] apeglm_1.16.0            jpeg_0.1-9               BiocParallel_1.28.3
[34] tibble_3.1.8             generics_0.1.3           ggplot2_3.3.6
[37] ashr_2.2-54              cli_3.4.1                magrittr_2.0.3
[40] crayon_1.5.2             deldir_1.0-6             systemPipeR_2.0.8
[43] fansi_1.0.3              MASS_7.3-54              gplots_3.1.3
[46] truncnorm_1.0-8          hwriter_1.3.2.1          tools_4.1.1
[49] BiocIO_1.4.0             lifecycle_1.0.3          stringr_1.4.1
[52] interp_1.1-3             munsell_0.5.0            locfit_1.5-9.6
[55] irlba_2.3.5.1            DelayedArray_0.20.0      Biostrings_2.62.0
[58] compiler_4.1.1           GreyListChIP_1.26.0      caTools_1.18.2
[61] rlang_1.0.6              grid_4.1.1               RCurl_1.98-1.9
[64] rjson_0.2.21             htmlwidgets_1.5.4        bitops_1.0-7
[67] restfulr_0.0.15          gtable_0.3.1             R6_2.5.1
[70] GenomicAlignments_1.30.0 rtracklayer_1.54.0       dplyr_1.0.10
[73] bdsmatrix_1.3-6          fastmap_1.1.0            utf8_1.2.2
[76] KernSmooth_2.23-20       stringi_1.7.8            SQUAREM_2021.1
[79] parallel_4.1.1           Rcpp_1.0.9               vctrs_0.4.2
[82] png_0.1-7                tidyselect_1.2.0         coda_0.19-4

ADD REPLY • link 3.1 years ago littlefishes20 ▴ 10

0

Entering edit mode

Yes, the fixes were introduced from DiffBind_3.6.3 in Bioconductor_3.15 (which uses R_4.2.0) so if you could update to a more recent version you may have a better experience.

ADD REPLY • link 3.1 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Hi Rory, thanks for your suggestion. I have updated the DiffBind to 3.6.5, which was built in 4.2.1, and tried some steps. However, none of them successed. Please find the following code for more information.

#multiple core
> tamoxifen <- dba.count(tamoxifen,bParallel=TRUE)
Computing summits...
Re-centering peaks...
Reads will be counted as Paired-end.
Reads will be counted as Paired-end.
failed to open the port 11152, trying a new port... # I have no idea about the port information
Error: Error processing one or more read files. Check warnings().
In addition: Warning messages:
1: In mclapply(arglist, fn, ..., mc.preschedule = TRUE, mc.allow.recursive = TRUE) :
  all scheduled cores encountered errors in user code
2:   error in evaluating the argument 'x' in selecting a method for function 'assay': wrong args for environment subassignment
3:   error in evaluating the argument 'x' in selecting a method for function 'assay': wrong args for environment subassignment

#single core, same as previous reported
> tamoxifen <- dba.count(tamoxifen,bParallel=FALSE)
Computing summits...
Sample: ./shifted.bam125
Sample: ./shifted.bam125
Re-centering peaks...
Sample: ./shifted.bam125
Reads will be counted as Paired-end.
'BiocParallel' did not register default BiocParallelParam:
  missing value where TRUE/FALSE needed
'BiocParallel' did not register default BiocParallelParam:
  missing value where TRUE/FALSE needed
Error in h(simpleError(msg, call)) :
  error in evaluating the argument 'x' in selecting a method for function 'assay': error in evaluating the argument 'BPPARAM' in selecting a method for function 'bplapply': attempt to select less than one element in get1index

#single core, with summit set to 50
> tamoxifen <- dba.count(tamoxifen,bParallel=FALSE, summits=50)
Computing summits...
Sample: ./shifted.bam125
Sample: ./shifted.bam125
Re-centering peaks...
Sample: ./shifted.bam125
Reads will be counted as Paired-end.
'BiocParallel' did not register default BiocParallelParam:
  missing value where TRUE/FALSE needed
'BiocParallel' did not register default BiocParallelParam:
  missing value where TRUE/FALSE needed
Error in h(simpleError(msg, call)) :
  error in evaluating the argument 'x' in selecting a method for function 'assay': error in evaluating the argument 'BPPARAM' in selecting a method for function 'bplapply': attempt to select less than one element in get1index

Do you have any idea about this problem? Many thanks in advance!

ADD REPLY • link 3.1 years ago littlefishes20 ▴ 10

0

Entering edit mode

Hi Rory,

I am wondering if my data is not suitable for running the complete analysis due to the data size or something else, would it be reasonable to generate a consensus peak set with DiffBind and calculate the read number in each peak with other tools such as featureCounts, and then continue with DiffBind (if DiffBind accepts such matrix as input) or go directly to DEseq2 or edgeR.

ADD REPLY • link 3.1 years ago littlefishes20 ▴ 10

1

Entering edit mode

Have you tried setting summits=0 or summits=25?

I'm perplexed by the errors which should have changed in the newer version, and I have no idea what port message was about. If you would be willing to make some of your data available to me (even two of the bam files would suffice along with the consensus peakset) I could get to the bottom of it. The number or reads or peaks shouldn't make a difference as DiffBind has been used many times on quite large datasets.

The alternatives you propose should work. You can export the consensus peakset using dba.peakset() bRetrieve=TRUE and or specifying a value for the writeFile parameter.

You can read the counts back in to DiffBind either by writing the counts for each sample out to a separate file and including them in your samplesheet (in a column labelled Counts) or by looping through the columns of a count matrix and calling dba.peakset() and including the counts= parameter.

There is some code to do this in a recent post: DiffBind and peak counts as input

If you don't read it back into DiffBind, there are a number of issues you'll have to deal with (such as normalization), as well as losing the reporting and plotting features, however you can of course do the rest of the processing yourself using the underlying tools (eg. DESeq2) if you are comfortable with each of processing steps and associated statistics.

ADD REPLY • link 3.1 years ago Rory Stark ★ 5.2k

0

Entering edit mode

DiffBind is without a doubt my first choice for analyzing my data. I am impressed by the comprehensive DiffBind tutorial, which takes into account numerous factors that we may overlook while processing ourselves.

In terms of summits setting, it works when I set summits=0! Could you please explain what difference the summit setting could make?

I greatly appreciate your assistance, thanks!

ADD REPLY • link 3.1 years ago littlefishes20 ▴ 10

score 1 · Answer 1 · 2022-10-27

In a Reply, littlefishes20 asks:

In terms of summits setting, it works when I set summits=0! Could you please explain what difference the summit setting could make?

I'm breaking this out into an Answer as it really is a distinct question.

The summits parameter in dba.count() alters the the consensus peakset in multiple, potentially impactful, ways.

In the default case where a consensus peakset has been formed by merging peak calls from many sample replicates, the resulting intervals may get wider and wider. Overlapping peaks are merged by increasing their width to cover all of the overlapping regions, so the more samples in the experiment, the wider the merged peaks can get. This can be problematic in that wider peaks are more likely to include "background" regions. In addition, the distribution of peak widths across the consensus peakset may have some outliers (especially very wide intervals) that can have an impact on the underlying assumptions of the statistical methods doing the modelling.

In general, DiffBind emphasizes confidently identifying regions with different read count distributions, rather than identifying the exact boundaries of those regions. You should be able to rely on the intervals identified as being differentially bound with low FDR, not necessarily that those regions perfectly encompass the precise boundaries of differential enrichment. To do this, we attempt to minimize the inclusion of background regions and focus on intervals where there is most likely to be true enrichment in at least one sample group.

The idea behind summits is then to identify the maximal point of enrichment across the samples in which the peaks were identified, and specify a new peak interval centered on this point. The widths of these intervals are drawn consistently using the value of summits (including summits base-pairs up- and down- stream of the summit, as well as the summit itself). This address both the issue of excess background (so long as the value of summits is appropriate to the assay, specifically not too large) as well as regularizing the distribution of peak widths.

Also worth noting is that if any of the re-centered peaks overlap, they will be merged and again re-centered, resulting in only one peak where there may have been more in the consensus peakset. So there may be cases where the original consensus peakset contains narrow peaks, but the summits parameter is larger, and after re-centering there are overlapping peaks that get merged. There are also edge cases where a summit is calculated to be near the edge of a peak interval and gets merged with an adjacent, but not previously overlapping, peak region after being re-centered.

When summits=FALSE, this re-centering is not performed at all and the (merged) consensus peakset is used as is. Re-centering is also not performed when summits=0, however in this case the summits are computed such that they may be examined, and subsequent calls to dba.count() can be made to see how different values of summits work without having to recompute all of the summits each time.

All of this complexity comes from the fact that in most of these enrichment assays, we do not have a pre-defined set of genomic intervals on which to focus. Windowing approaches that skip peak calling completely, like csaw, avoid some of these issues