Search
Question: DiffBind remove low count intervals for differential analysis
0
11 months ago by
gasta880
gasta880 wrote:

Hello,

I'm carrying on part of a ChIP-Seq differential analysis for the first time.

I have 6 replicates and 3 conditions; in each condition the first 3 replicates have been run separately from the other 3.

I have a couple of questions:

1) In the analysis, the consensus profile is calculated like so:

peakset <- dba(sampleSheet="/homes/fgastaldello/sumodiff/sampleSheet.csv")

peakset <- dba.peakset(peakset, consensus=DBA_CONDITION, minOverlap=3)

Is it correct to assume that peaks are pooled together by condition and only those that satisfy the minOverlap parameter are included in the profile?

2) After generating the counts using the standard score for the dba.count function, a mask is created to separate conditions and the first 3 replicates from the other 3. A contrast is generated and the differential analysis is performed like so:

batch.mask <- dba.mask(peakset,DBA_REPLICATE, c(1,2,3))

peaks.de <- dba.analyze(peakset,bParallel = FALSE,filter=?,filterFun=??)

I would like to remove those intervals/peaks that contains a count lower than filter? From what I understood from the documentation and other posts, filterFun is applied to each interval and discard anything that is less than filter.

But are the intervals/peaks separated by blocks? By conditions as specified in the contrast?

filterFun can be customized, but is there any example that I can use as a reference?

Thanks,

Francesco

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

locale:
[1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8
[6] LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] DBChIP_1.18.0              data.table_1.10.4          cowplot_0.9.1              ggplot2_2.2.1              kableExtra_0.6.0
[6] DESeq_1.26.0               lattice_0.20-34            locfit_1.5-9.1             edgeR_3.16.5               limma_3.30.13
[11] knitr_1.15.1               BiocInstaller_1.24.0       plyr_1.8.4                 GenomicFeatures_1.26.3     AnnotationDbi_1.36.2
[16] biomaRt_2.30.0             Gviz_1.18.2                DiffBind_2.2.9             SummarizedExperiment_1.4.0 Biobase_2.34.0
[21] GenomicRanges_1.26.4       GenomeInfoDb_1.10.3        IRanges_2.8.2              S4Vectors_0.12.2           BiocGenerics_0.20.0

loaded via a namespace (and not attached):
[1] amap_0.8-14                   colorspace_1.3-2              rjson_0.2.15                  hwriter_1.3.2
[5] rprojroot_1.2                 biovizBase_1.22.0             htmlTable_1.9                 XVector_0.14.1
[9] base64enc_0.1-3               dichromat_2.0-0               interactiveDisplayBase_1.12.0 xml2_1.1.1
[13] splines_3.3.1                 fail_1.3                      geneplotter_1.52.0            Formula_1.2-1
[17] Rsamtools_1.26.1              annotate_1.52.1               cluster_2.0.6                 GO.db_3.4.0
[25] httr_1.2.1                    GOstats_2.40.0                backports_1.0.5               assertthat_0.1
[29] Matrix_1.2-8                  lazyeval_0.2.0                acepack_1.4.1                 htmltools_0.3.5
[33] tools_3.3.1                   gtable_0.2.0                  Category_2.40.0               systemPipeR_1.8.1
[41] gdata_2.17.0                  rtracklayer_1.34.2            stringr_1.2.0                 rvest_0.3.2
[45] mime_0.5                      ensembldb_1.6.2               gtools_3.5.0                  XML_3.98-1.5
[49] AnnotationHub_2.6.5           zlibbioc_1.20.0               scales_0.4.1                  BSgenome_1.42.0
[53] VariantAnnotation_1.20.3      hms_0.3                       RBGL_1.50.0                   RColorBrewer_1.1-2
[57] BBmisc_1.11                   yaml_2.1.14                   memoise_1.0.0                 gridExtra_2.2.1
[61] rpart_4.1-10                  latticeExtra_0.6-28           stringi_1.1.2                 RSQLite_1.1-2
[65] genefilter_1.56.0             checkmate_1.8.2               caTools_1.17.1                BiocParallel_1.8.1
[69] BatchJobs_1.6                 matrixStats_0.51.0            bitops_1.0-6                  evaluate_0.10
[73] GenomicAlignments_1.10.1      htmlwidgets_0.8               GSEABase_1.36.0               AnnotationForge_1.16.1
[77] magrittr_1.5                  sendmailR_1.2-1               R6_2.2.0                      gplots_3.0.1
[81] Hmisc_4.0-2                   DBI_0.6                       foreign_0.8-67                survival_2.41-2
[85] RCurl_1.95-4.8                nnet_7.3-12                   tibble_1.2                    KernSmooth_2.23-15
[89] rmarkdown_1.6                 digest_0.6.12                 xtable_1.8-2                  httpuv_1.3.3
[93] brew_1.0-6                    munsell_0.4.3                 viridisLite_0.1.3
modified 11 months ago • written 11 months ago by gasta880
0
11 months ago by
Rory Stark2.6k
CRUK, Cambridge, UK
Rory Stark2.6k wrote:

Let's take this set-by-step as you raise a number of issues.

First is the derivation of the consensus peakset. Your second line of code, calling dba.peakset(), does not determine the consensus peakset used for counting. I’m a little confused as to how you used this result, as passing it to dba.count() should result in an error. What does your call to dba.count() look like?

If you don’t pass a specific set of peaks to dba.count(), it will generate a consensus peakset using the minOverlap parameter across all the samples. Is this what you want? Or do you want to first create separate consensus peaksets for each of the two conditions (consisting of peaks that overlap in at least three of the six replicates), then take the union of these two peaksets as the overall consensus? If it is the latter you want, you can accomplish this with the following code:

> peakset <- dba(sampleSheet="/homes/fgastaldello/sumodiff/sampleSheet.csv")
> consensus.dba <- dba.peakset(peakset, consensus=DBA_CONDITION, minOverlap=3)
> consensus.dba <- dba(consensus.dba,mask=consensus1$masks$Consensus,minOverlap=1)
> consensus.peaks <- dba.peakset(consensus.dba,bRetrieve=TRUE)
> peakset <- dba.count(peakset,peaks=consensus.peaks)

Second is the how you are modelling the fact that the first three replicates were done in a batch separately from the second three replicates. Your batch.mask should be fine for accomplishing this.

Third is the filtering. The key thing to remember here is that there is a single binding count matrix created for each contrast, which includes as rows all the consensus sites and as columns all the samples in the contrast. The blocking factor only impacts the modelling of the data (when the liner models are fit). The filtering occurs when the binding matrix is created: the filterFun will be applied to each row of the binding matrix (each consensus site) across ALL the samples. So in the default case where filterFun=max, any consensus site where there isn’t at least one sample with at least as many reads counts as the filter value will be removed.

If you wanted to apply more complex filtering criteria, you can write your own function filterFun. This function should take a single parameter, a vector of read count values, and return a score.

• NB: there is actually a bug in the current release where using filtering in dba.analyze() may result in a situation where you can not work with the results, e.g. dba.report() will report an error. This is being fixed, but in the meantime, the workaround is to apply the filtering at the dba.count() stage.

Cheers-
Rory

0
11 months ago by
gasta880
gasta880 wrote:

I apology for the misunderstanding in my first question. The approach that you described is exactly what my colleague has done in calculating the consensus and counting peaks. Here is the original script launched on our cluster:

peakset <- dba(sampleSheet="sampleSheet.csv")
peakset.cons <- dba.peakset(peakset, consensus=DBA_CONDITION, minOverlap=3)
peakset.cons <- dba(peakset.cons, mask=peakset.cons$masks$Consensus, minOverlap=1)
peaks.cons <- dba.peakset(peakset.cons, bRetrieve = TRUE)
peakcount <- dba.count(peakset, peaks=peaks.cons, bParallel=TRUE)

The clarification about the role of the DBA_CONDITION flag was what I needed to understand how the consensus peakset was made.

For the filtering question, it is more clear now! I'll apply the filtering stage in dba.count() for safety.

Thanks,

Francesco