Questions about DiffBind loaded peaks and re-count reads
6
0
Entering edit mode
Tongjun • 0
@tongjun-24234
Last seen 2.3 years ago
United States

Dear Developers,

Thank you for the nice software! I am using DiffBind for merging peaks and re-counting the reads in the consensus regions and have a few questions about diffBind.

  1. the last column in the peaks after loading the peaks using dba (peakcaller is set to 'bed'). It seems it extracted the first three columns and the fifth column. However the value for the fifth column is different from the original file. Such as:

nova = dba(sampleSheet=samples)
head(nova$peaks[[1]])

  V1     V2     V3         V5
1 chr1 835381 835586 0.09638554
2 chr1 842890 843095 0.10843373
3 chr1 853018 853247 0.15060241
4 chr1 866363 866568 0.08433735
5 chr1 882661 882887 0.21084337
6 chr1 894995 895521 0.09638554

colnames(samples)

[1] "SampleID"   "Tissue"     "bamReads"   "ControlID"  "bamControl"
[6] "Peaks"      "PeakCaller"

In the original file:
   V1     V2     V3                       V4 V5 V6      V7      V8      V9
1 chr1 835381 835586 10_10589255_q0.01_peak_1 16  . 2.97909 3.23418 1.62779
2 chr1 842890 843095 10_10589255_q0.01_peak_2 18  . 3.13408 3.50458 1.83233
3 chr1 853018 853247 10_10589255_q0.01_peak_3 25  . 3.36780 4.39197 2.51693
4 chr1 866363 866568 10_10589255_q0.01_peak_4 14  . 2.61507 2.97812 1.44712
5 chr1 882661 882887 10_10589255_q0.01_peak_5 35  . 4.17826 5.78446 3.59553
6 chr1 894995 895521 10_10589255_q0.01_peak_6 16  . 2.86883 3.23247 1.63777

Can you please explain what the value means in the DBA peaks (such as 0.09638554)? How was it calculated?

  1. I have more than 1000 peak files/bam files. When I re-count the reads in the consensus regions, I always meet errors. Can I do the re-count for the same regions for one sample each time? If so, please guide me how to do it.
nova2 <- dba.count(nova, minOverlap=50, summits=FALSE, bScaleControl=TRUE, bParallel=4)

after running quite a while, the following errors come out:
[E::bgzf_read] Read block operation failed with error -1 after 0 of 4 bytes
[E::bgzf_read] Read block operation failed with error -1 after 0 of 4 bytes
[E::bgzf_read] Read block operation failed with error -1 after 0 of 4 bytes
[E::bgzf_read] Read block operation failed with error -1 after 0 of 4 bytes
[E::bgzf_read] Read block operation failed with error -1 after 0 of 4 bytes
[E::bgzf_read] Read block operation failed with error -1 after 0 of 4 bytes
[E::bgzf_read] Read block operation failed with error -1 after 0 of 4 bytes
[E::bgzf_read] Read block operation failed with error -1 after 0 of 4 bytes
...
[W::bam_hdr_read] bgzf_check_EOF: Cannot allocate memory
[E::bam_hdr_read] Invalid BAM binary header
[E::bgzf_read] Read block operation failed with error -1 after 0 of 4 bytes
[E::bam_hdr_read] Error reading BGZF stream

 *** caught segfault ***
address 0x10, cause 'memory not mapped'

Traceback:
 1: cpp_count_reads(bamfile, insertLength, fileType, bufferSize,     intervals, bWithoutDupes, summits, minMappingQuality)
 2: pv.getCounts(bamfile = countrec$bamfile, intervals = intervals,     insertLength = countrec$insert, bWithoutDupes = bWithoutDupes,     bLowMem = bLowMem, yieldSize = yieldSize, mode = mode, singleEnd = singleEnd,     scanbamparam = scanbamparam, fileType = fileType, summits = summits,     fragments = fragments, minMappingQuality = minMappingQuality)
 3: FUN(X[[i]], ...)
 4: lapply(X = S, FUN = FUN, ...)
 5: doTryCatch(return(expr), name, parentenv, handler)
 6: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 7: tryCatchList(expr, classes, parentenv, handlers)
 8: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call)[1L]        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L        sm <- strsplit(conditionMessage(e), "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")        if (w > LONG)             prefix <- paste0(prefix, "\n  ")    }    else prefix <- "Error : "    msg <- paste0(prefix, conditionMessage(e), "\n")    .Internal(seterrmessage(msg[1L]))    if (!silent && isTRUE(getOption("show.error.messages"))) {        cat(msg, file = outFile)        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})
 9: try(lapply(X = S, FUN = FUN, ...), silent = TRUE)
10: sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE))
11: FUN(X[[i]], ...)
12: lapply(seq_len(cores), inner.do)
13: mclapply(arglist, fn, ..., mc.preschedule = TRUE, mc.allow.recursive = TRUE)
14: config$lapplyFun(config, params, arglist, fn, ...)
15: dba.parallel.lapply(pv$config, params, todorecs, pv.do_getCounts,     bed, bWithoutDupes = bWithoutDupes, bLowMem, yieldSize, mode,     singleEnd, scanbamparam, readFormat, summits, fragments,     minMappingQuality)
16: pv.counts(DBA, peaks = peaks, minOverlap = minOverlap, defaultScore = score,     bLog = bLog, insertLength = fragmentSize, bOnlyCounts = T,     bCalledMasks = TRUE, minMaxval = filter, bParallel = bParallel,     bUseLast = bUseLast, bWithoutDupes = bRemoveDuplicates, bScaleControl = bScaleControl,     filterFun = filterFun, bLowMem = bUseSummarizeOverlaps, readFormat = readFormat,     summits = summits, minMappingQuality = mapQCth)
17: dba.count(nova, minOverlap = 50, summits = FALSE, bScaleControl = TRUE,     score = DBA_SCORE_RPKM, bParallel = 4)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 
 *** caught segfault ***
address 0x10, cause 'memory not mapped'
  1. for broad peaks, if I use summit parameter in the dba.count, will the summit be generated for each merged peak per sample? Can the summit be generated according to the distribution of the reads in all samples?

Thank you very much!

Tongjun

DiffBind • 3.3k views
ADD COMMENT
0
Entering edit mode
Rory Stark ★ 5.2k
@rory-stark-5741
Last seen 5 weeks ago
Cambridge, UK

[1]. The peak score (prior to counting reads) is the score from the ScoreCol (fifth column for bed), normalized to a 0..1 interval by dividing each score by the maximum score. Note that you can retrieve a GRanges object with the scores for each peakset by calling dba.peakset() with peaks=n (where n is the sample number) and bRetrieve=TRUE.

[2]. I'm guessing there is a memory issue when reading. It may not scale to 1,000 samples, depending on how large the consensus peakset is. You can check how many intervals there are with the following line of code:

dba.overlap(nova, mode=DBA_OLAP_RATE)[50]

You can also try running the dba.count()with bParallel=FALSE. It will take longer, but use less memory, and give you status messages telling you exactly where it is failing.

[3]. summits are calculated for each (merged) consensus peak, but not per sample (only one summit is computed per interval). The pileups for at each nt are counted for each sample for which a called peak overlapped the merged peak, then the overall summit (point of highest pileup) is used. Note that if you are having memory problems with summits=FALSE, they may be worse if you turn summits on, as this step takes more memory.

ADD COMMENT
0
Entering edit mode
Tongjun • 0
@tongjun-24234
Last seen 2.3 years ago
United States

Thank you very much for your answer!

After turn off bParallel, dba.count works nicely. However there is no dba.normalize method in the DiffBind_2.14.0. The error I met for running dba.normalize is as below:

> library(DiffBind)
> dba.normalize
Error: object 'dba.normalize' not found

> dba.
dba.analyze      dba.count        dba.mask         dba.peakset      dba.plotHeatmap  dba.plotPCA      dba.plotVolcano  dba.save
dba.contrast     dba.load         dba.overlap      dba.plotBox      dba.plotMA       dba.plotVenn     dba.report       dba.show

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] DiffBind_2.14.0             SummarizedExperiment_1.16.1
 [3] DelayedArray_0.12.3         BiocParallel_1.20.1        
 [5] matrixStats_0.57.0          Biobase_2.46.0             
 [7] GenomicRanges_1.38.0        GenomeInfoDb_1.22.1        
 [9] IRanges_2.20.2              S4Vectors_0.24.4           
[11] BiocGenerics_0.32.0        

loaded via a namespace (and not attached):
  [1] Category_2.52.1          bitops_1.0-6             bit64_4.0.5             
  [4] RColorBrewer_1.1-2       progress_1.2.2           httr_1.4.2              
  [7] Rgraphviz_2.30.0         backports_1.2.0          tools_3.6.3             
 [10] R6_2.5.0                 KernSmooth_2.23-18       DBI_1.1.0               
 [13] colorspace_2.0-0         withr_2.3.0              tidyselect_1.1.0        
 [16] prettyunits_1.1.1        bit_4.0.4                curl_4.3                
 [19] compiler_3.6.3           graph_1.64.0             rtracklayer_1.46.0      
 [22] checkmate_2.0.0          caTools_1.18.0           scales_1.1.1            
 [25] genefilter_1.68.0        RBGL_1.62.1              askpass_1.1             
 [28] rappdirs_0.3.1           stringr_1.4.0            digest_0.6.27           
 [31] Rsamtools_2.2.3          AnnotationForge_1.28.0   XVector_0.26.0          
 [34] jpeg_0.1-8.1             pkgconfig_2.0.3          BSgenome_1.54.0         
 [37] dbplyr_2.0.0             limma_3.42.2             rlang_0.4.8             
 [40] RSQLite_2.2.1            GOstats_2.52.0           generics_0.1.0          
 [43] hwriter_1.3.2            gtools_3.8.2             dplyr_1.0.2             
 [46] VariantAnnotation_1.32.0 RCurl_1.98-1.2           magrittr_1.5            
 [49] GO.db_3.10.0             GenomeInfoDbData_1.2.2   Matrix_1.2-18           
 [52] Rcpp_1.0.5               munsell_0.5.0            lifecycle_0.2.0         
 [55] yaml_2.2.1               stringi_1.5.3            edgeR_3.28.1            
 [58] debugme_1.1.0            zlibbioc_1.32.0          gplots_3.1.0            
 [61] BiocFileCache_1.10.2     grid_3.6.3               blob_1.2.1              
 [64] ggrepel_0.8.2            crayon_1.3.4             lattice_0.20-41         
 [67] splines_3.6.3            Biostrings_2.54.0        GenomicFeatures_1.38.2  
 [70] annotate_1.64.0          hms_0.5.3                batchtools_0.9.14       
 [73] locfit_1.5-9.4           pillar_1.4.6             rjson_0.2.20            
 [76] systemPipeR_1.20.0       base64url_1.4            biomaRt_2.42.1          
 [79] XML_3.99-0.3             glue_1.4.2               ShortRead_1.44.3        
 [82] latticeExtra_0.6-29      data.table_1.13.2        vctrs_0.3.5             
 [85] png_0.1-7                gtable_0.3.0             openssl_1.4.3           
 [88] purrr_0.3.4              amap_0.8-18              assertthat_0.2.1        
 [91] ggplot2_3.3.2            xtable_1.8-4             survival_3.2-7          
 [94] pheatmap_1.0.12          tibble_3.0.4             GenomicAlignments_1.22.1
 [97] AnnotationDbi_1.48.0     memoise_1.1.0            brew_1.0-6              
[100] ellipsis_0.3.1           GSEABase_1.48.0         
>

If I run dba.count with score=DBA_SCORE_RPKM, will the peaks return with score normalized to the total aligned reads from the ChIP sample? From the tests shown in the manual of DiffBind, the DBA_LIBSIZE_FULL is a better approach. But I am not sure how to apply it using the aligned reads from ChIP samples. Thank you.

ADD COMMENT
0
Entering edit mode
Rory Stark ★ 5.2k
@rory-stark-5741
Last seen 5 weeks ago
Cambridge, UK

You are running older versions of R and Bioconductor. dba.normalize() was introduced in Bioconductor 3.12 (DiffBind_3.0.7).

To use the latest version, you need to update to R version 4.0.3 and then install the current Bioconductor version using BiocManager::install(version = "3.12").

ADD COMMENT
0
Entering edit mode
Tongjun • 0
@tongjun-24234
Last seen 2.3 years ago
United States

Thank you very much, Dr. Stark! We have R/4.0.2 available on the server and I updated everything to version 3.12 and dba.normalize is available. I still have a few other questions:

  1. I used dba.save to save my DBA object for using later, but after loading it, I could not run dba.peakset to get the consensus peak. I googled it and couldn't find a solution. Here are the commands and errors:
> load("dba.novogene.dba.RData")
> novo = DBAobject
> names(novo)
 [1] "peaks"       "class"       "chrmap"      "config"      "samples"    
 [6] "called"      "totalMerged" "merged"      "attributes"  "minOverlap" 
[11] "masks"

> consensus.peaks <- dba.peakset(novo, bRetrieve=TRUE)
error:
Error in rep(T, nrow(pv$binding)) : invalid 'times' argument

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] DiffBind_3.0.7              SummarizedExperiment_1.20.0
 [3] Biobase_2.50.0              MatrixGenerics_1.2.0       
 [5] matrixStats_0.57.0          GenomicRanges_1.42.0       
 [7] GenomeInfoDb_1.26.1         IRanges_2.24.0             
 [9] S4Vectors_0.28.0            BiocGenerics_0.36.0        

loaded via a namespace (and not attached):
  [1] amap_0.8-18              colorspace_2.0-0         rjson_0.2.20            
  [4] hwriter_1.3.2            ellipsis_0.3.1           XVector_0.30.0          
  [7] ggrepel_0.8.2            bit64_4.0.5              mvtnorm_1.1-1           
 [10] apeglm_1.12.0            AnnotationDbi_1.52.0     xml2_1.3.2              
 [13] splines_4.0.2            jsonlite_1.7.1           Rsamtools_2.6.0         
 [16] annotate_1.68.0          ashr_2.2-47              GO.db_3.12.1            
 [19] dbplyr_2.0.0             png_0.1-7                GreyListChIP_1.22.0     
 [22] pheatmap_1.0.12          graph_1.68.0             compiler_4.0.2          
 [25] httr_1.4.2               GOstats_2.56.0           backports_1.2.0         
 [28] assertthat_0.2.1         Matrix_1.2-18            limma_3.46.0            
 [31] prettyunits_1.1.1        tools_4.0.2              coda_0.19-4             
 [34] gtable_0.3.0             glue_1.4.2               GenomeInfoDbData_1.2.4  
 [37] Category_2.56.0          systemPipeR_1.24.2       dplyr_1.0.2             
 [40] rsvg_2.1                 batchtools_0.9.14        rappdirs_0.3.1          
 [43] V8_3.4.0                 ShortRead_1.48.0         Rcpp_1.0.5              
 [46] bbmle_1.0.23.1           vctrs_0.3.5              Biostrings_2.58.0       
 [49] debugme_1.1.0            rtracklayer_1.50.0       stringr_1.4.0           
 [52] irlba_2.3.3              lifecycle_0.2.0          gtools_3.8.2            
 [55] XML_3.99-0.5             edgeR_3.32.0             MASS_7.3-53             
 [58] zlibbioc_1.36.0          scales_1.1.1             BSgenome_1.58.0         
 [61] VariantAnnotation_1.36.0 hms_0.5.3                RBGL_1.66.0             
 [64] RColorBrewer_1.1-2       yaml_2.2.1               curl_4.3                
 [67] memoise_1.1.0            ggplot2_3.3.2            emdbook_1.3.12          
 [70] bdsmatrix_1.3-4          biomaRt_2.46.0           SQUAREM_2020.5          
 [73] latticeExtra_0.6-29      stringi_1.5.3            RSQLite_2.2.1           
 [76] genefilter_1.72.0        checkmate_2.0.0          GenomicFeatures_1.42.1  
 [79] caTools_1.18.0           BiocParallel_1.24.1      DOT_0.1                 
 [82] truncnorm_1.0-8          rlang_0.4.9              pkgconfig_2.0.3         
 [85] bitops_1.0-6             invgamma_1.1             lattice_0.20-41         
 [88] purrr_0.3.4              GenomicAlignments_1.26.0 bit_4.0.4               
 [91] tidyselect_1.1.0         GSEABase_1.52.0          AnnotationForge_1.32.0  
 [94] plyr_1.8.6               magrittr_2.0.1           R6_2.5.0                
 [97] gplots_3.1.1             generics_0.1.0           base64url_1.4           
[100] DelayedArray_0.16.0      DBI_1.1.0                pillar_1.4.7            
[103] withr_2.3.0              mixsqp_0.3-43            survival_3.2-7          
[106] RCurl_1.98-1.2           tibble_3.0.4             crayon_1.3.4            
[109] KernSmooth_2.23-18       BiocFileCache_1.14.0     jpeg_0.1-8.1            
[112] progress_1.2.2           locfit_1.5-9.4           grid_4.0.2              
[115] data.table_1.13.2        blob_1.2.1               Rgraphviz_2.34.0        
[118] digest_0.6.27            xtable_1.8-4             numDeriv_2016.8-1.1     
[121] brew_1.0-6               openssl_1.4.3            munsell_0.5.0           
[124] askpass_1.1

I used the same version of R and DiffBind for saving and loading.

  1. how to obtain a subset of the consensus peaks? such as 0.1% of the merged peaks ?

I could convert the consensus.peaks to a data.frame and extracted a subset but not sure how to convert it back.

  1. the meaning of the output from dba.count.

Here are the commands and the headers

> novo2 <- dba.count(novo, peaks=consensus.peaks, fragmentSize=0, summits=FALSE, bSubControl=FALSE, score=DBA_SCORE_RPKM, bParallel=FALSE)
> head(novo$peaks[[1]])
 Chr  Start    End     Score      RPKM Reads      cRPKM cReads
1 chr1 818625 819192 1.9136482 1.9136482    15 0.20383173      2
2 chr1 828966 829355 1.6722341 1.6722341     9 0.70790318      4
3 chr1 832805 833037 0.3110021 0.3110021     1 0.42044921      2
4 chr1 833800 834440 1.0174279 1.0174279     9 0.09725609      1
5 chr1 855020 856486 1.0373095 1.0373095    21 0.10320377      3
6 chr1 860637 860909 1.5926039 1.5926039     6 0.22835586      1
> info = dba.show(novo)
> info
             ID Tissue Caller Intervals    Reads FRiP
1   10_10606903  Brain counts    210330 13800055 0.60
2 101_11615242b  Brain counts    210330 10101930 0.43
3 103_43485807b  Brain counts    210330 11268599 0.40

In novo$peaks[[1]]:

a, Is the RPKM calculated using the Reads in info (the total aligned paired-end reads) and length of the peak?

b, Is the Reads the raw reads in the peak?

c, What are cRPKM and cReads?

d, Can the program to recognize whether the bam files are single-end or paired-end samples? Most of our samples are paired-end and a portion of samples are single-end samples.

  1. when I run dba.normalize following the above dba.count, no results come out. I tried the following commands:
> norm <- dba.normalize(novo, bRetrieve=TRUE, method=DBA_EDGER, normalize=DBA_NORM_TMM, library=DBA_LIBSIZE_FULL, background=FALSE)
> norm
NULL
> norm <- dba.normalize(novo, bRetrieve=TRUE)
> norm
NULL
  1. when use summit in the dba.count, can the summit region be calculated by percentage rather than a fix region? Such as make a distribution of the reads coverage and cut with 95% of the distribution as a threshold and then keep the region that covered with threshold. Or extract the region simply with 95% expanding from the summit.

  2. If I turn on bParallel=TRUE in dba.count, can I also specify the number of threads and memory per thread?

I appreciate your time!

Tongjun

ADD COMMENT
0
Entering edit mode
Rory Stark ★ 5.2k
@rory-stark-5741
Last seen 5 weeks ago
Cambridge, UK

I used dba.save to save my DBA object for using later

When you save your object using dba.save(), you need to use dba.load() to restore it:

novo <- dba.load("novogene.dba")

how to obtain a subset of the consensus peaks? such as 0.1% of the merged peaks ? I could convert the consensus.peaks to a data.frame and extracted a subset but not sure how to convert it back.

novo <- dba(novo, minOverlap=0.99)
consensus.peaks <- dba.peakset(novo, bRetrieve=TRUE)

the meaning of the output from dba.count

You are looking at undocumented DiffBind internals. The documented way to get this information is to set score using the score parameter in dba.count(). (You can reset the score without re-counting by setting peaks=NULL).

Is the RPKM calculated using the Reads in info (the total aligned paired-end reads) and length of the peak?

Yes, this is a function of the total number of reads in the library, the number of overlapping reads, and the length of the peak.

Is the Reads the raw reads in the peak?

Yes.

What are cRPKM and cReads?

cReads is the total number of Control reads (form the Control track) overlapping the peak. cRPKM is the RPKM calculation of this.

Can the program to recognize whether the bam files are single-end or paired-end samples? Most of our samples are paired-end and a portion of samples are single-end samples.

DiffBind automagically recognizes if the bam files are paired-end or single-end and counts them appropriately (assuming the default bUseSummarizeOverlaps=TRUE). However currently there is a limitation where all the bam files are counted the same way, so it is not easy to mix paired and single end in the same project. This is on the list of features to add. The main way to get around this is to treat all the data as single-end, so for the paired-end data only the first read is counted. This can be accomplished by setting bUseSummarizeOverlaps=FALSE.

when I run dba.normalize following the above dba.count, no results come out

You should normalize first, then separately retrieve the normalization record:

novo <- dba.normalize(novo, method=DBA_EDGER, normalize=DBA_NORM_TMM, library=DBA_LIBSIZE_FULL, background=FALSE)
norm <- dba.normalize(novo, bRetrieve=TRUE)

when use summit in the dba.count, can the summit region be calculated by percentage rather than a fix region? Such as make a distribution of the reads coverage and cut with 95% of the distribution as a threshold and then keep the region that covered with threshold. Or extract the region simply with 95% expanding from the summit.

This is not supported, the idea is to have consistent peak widths. However it would be possible to extract the computed summits and calculate your own consensus peakset with different peak widths and supply that to dba.count().

If I turn on bParallel=TRUE in dba.count, can I also specify the number of threads and memory per thread?

Parallel mode is turned on by default. You can control parallel mode by setting:

novo$config$RunParallel <- TRUE #FALSE to revert to serial mode

You can control the number of cores by setting:

novo$config$cores <- 8 #default is BiocParallel::multicoreWorkers()

You can not control how much memory is allocated to each core!

ADD COMMENT
0
Entering edit mode

Thank you! It helps so much!

ADD REPLY
0
Entering edit mode

Hi Dr. Stark, I have questions about the reads number counted by diffBind function of dba.count. I used DiffBind_3.0.15 under R/4.0.5. Here is the command I used:

fsize = c(rep(151,100), rep(150, 100), rep(51,100))

novo <- dba.count(novo, peaks=consensus.peaks, fragmentSize=fsize, summits=FALSE, bSubControl=FALSE, score=DBA_SCORE_RPKM, bParallel=FALSE, bUseSummarizeOverlaps=FALSE)
info = dba.show(novo)
info$Reads

My input is a mixed of paired-end (read length is 150 or 151) and single-end (read length is 51) samples. Are the reads counted by dba.count for paired-end doubled by 2 folds (because of paired-end)? Although you explained earlier that by setting bUseSummarizeOverlaps=FALSE, the paired-end samples can be treated as single-end sample. In the output of info$Reads, the reads is way more than it should be for paired-end samples and also a little more for the single-end samples.

What is the default mapping quality used for filtering the reads?

Will a mixed of paired-end and single-end samples affect the read counting?

Thank you very much!

ADD REPLY
0
Entering edit mode
Rory Stark ★ 5.2k
@rory-stark-5741
Last seen 5 weeks ago
Cambridge, UK

DiffBind_3.2 in Bioconductor 3.12 and subsequent releases include support for mixing paired-end and single-end libraries, however the default bUseSummarizeOverlaps=TRUE should be used. When this is FALSE, all reads are counted, so paired-end reads are indeed counted twice and single-end reads only once, so mixing them will indeed cause problems. From DiffBind_3.2 onwards, the software should automatically detect the paired-end or single-end status of each bam file.

The default mapping quality used for filtering is 15; can it be changed by re-setting the DBA$config$minQCth value or specifying it using the mapQCth parameter when calling dba.count().

ADD COMMENT
0
Entering edit mode

Thank you for your reply. When I read the manual again, I found that if bUseSummarizeOverlaps=TRUE, the The fragmentSize parameter must absent. Then I should remove the fragmentSize parameter. Am I right? Will this affect the counting for the single-end sample?

Should I run the dba.count as following? Thank you very much.

novo <- dba.count(novo, peaks=consensus.peaks, summits=FALSE, bSubControl=FALSE, score=DBA_SCORE_RPKM, bParallel=FALSE, bUseSummarizeOverlaps=TRUE, mapQCth=20)
ADD REPLY

Login before adding your answer.

Traffic: 544 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6