Question

DiffBind - Losing replicate information?

0

Entering edit mode

Sophie Shaw • 0

@sophie-shaw-14181

Last seen 6.5 years ago

Aberdeen, UK

Hi DiffBind Team,

I've been looking to use your package to analyse some ChIP seq data and have been following the vignette for guidance.

My data has no replicates and I'm fully aware that this is not a good study design, but it's the data we have to work with. I have been working with two peak callers (MACS and SPP) and was hoping to treat these two sets of peaks as "replicates" in DiffBind. I'm convinced I've read a suggestion for this in the past but can't find the statement anywhere in the DiffBind documentation.

I've set up the sample sheet as follows, using the same Sample ID/Bam file but different peak callers:

SampleID,Tissue,Factor,Treatment,Replicate,bamReads,ControlID,bamControl,Peaks,PeakCaller

WT_HU,WT,Rif1,HU,1,../2016_020B_3_WT_HU_sorted_alignment.bam,WT_HU_1,../2016_020B_1_WT_HU_sorted_alignment.bam,../MACS2/WT_HU_peaks.narrowPeak,macs

MT_HU,MT,Rif1,HU,1,../2016_020B_4_MT_HU_sorted_alignment.bam,MT_HU_2,../2016_020B_2_MT_HU_sorted_alignment.bam,../MACS2/MT_HU_peaks.narrowPeak,macs

WT_G1,WT,Rif1,G1,1,../2016_024B_3_WT_G1_sorted_alignment.bam,WT_G1_1,../2016_024B_1_WT_G1_sorted_alignment.bam,../MACS2/WT_G1_peaks.narrowPeak,macs

MT_G1,MT,Rif1,G1,1,../2016_024B_4_MT_G1_sorted_alignment.bam,MT_G1_2,../2016_024B_2_MT_G1_sorted_alignment.bam,../MACS2/MT_G1_peaks.narrowPeak,macs

WT_S60,WT,Rif1,S60,1,../2017_006B_S1_WT_S60_sorted_alignment.bam,WT_S60_1,../2017_006B_1_WT_S60_sorted_alignment.bam,../MACS2/WT_S60_peaks.narrowPeak,macs

MT_S60,MT,Rif1,S60,1,../2017_006B_S5_MT_S60_sorted_alignment.bam,MT_S60_5,../2017_006B_5_MT_S60_sorted_alignment.bam,../MACS2/MT_S60_peaks.narrowPeak,macs

WT_S90,WT,Rif1,S90,1,../2017_006B_S4_WT_S90_sorted_alignment.bam,WT_S90_4,../2017_006B_4_WT_S90_sorted_alignment.bam,../MACS2/WT_S90_peaks.narrowPeak,macs

MT_S90,MT,Rif1,S90,1,../2017_006B_S8_MT_S90_sorted_alignment.bam,MT_S90_8,../2017_006B_8_MT_S90_sorted_alignment.bam,../MACS2/MT_S90_peaks.narrowPeak,macs

WT_HU,WT,Rif1,HU,2,../2016_020B_3_WT_HU_sorted_alignment.bam,WT_HU_1,../2016_020B_1_WT_HU_sorted_alignment.bam,../SPP_Peaks/2016_020B_1_WT_HU_sorted_alignment_VS_2016_020B_3_WT_HU_sorted_alignment.narrowPeak,narrow

MT_HU,MT,Rif1,HU,2,../2016_020B_4_MT_HU_sorted_alignment.bam,MT_HU_2,../2016_020B_2_MT_HU_sorted_alignment.bam,../SPP_Peaks/2016_020B_2_MT_HU_sorted_alignment_VS_2016_020B_4_MT_HU_sorted_alignment.narrowPeak,narrow

WT_G1,WT,Rif1,G1,2,../2016_024B_3_WT_G1_sorted_alignment.bam,WT_G1_1,../2016_024B_1_WT_G1_sorted_alignment.bam,../SPP_Peaks/2016_024B_1_WT_G1_sorted_alignment_VS_2016_024B_3_WT_G1_sorted_alignment.narrowPeak,narrow

MT_G1,MT,Rif1,G1,2,../2016_024B_4_MT_G1_sorted_alignment.bam,MT_G1_2,../2016_024B_2_MT_G1_sorted_alignment.bam,../SPP_Peaks/2016_024B_2_MT_G1_sorted_alignment_VS_2016_024B_4_MT_G1_sorted_alignment.narrowPeak,narrow

WT_S60,WT,Rif1,S60,2,../2017_006B_S1_WT_S60_sorted_alignment.bam,WT_S60_1,../2017_006B_1_WT_S60_sorted_alignment.bam,../SPP_Peaks/2017_006B_1_WT_S60_sorted_alignment_VS_2017_006B_S1_WT_S60_sorted_alignment.narrowPeak,narrow

MT_S60,MT,Rif1,S60,2,../2017_006B_S5_MT_S60_sorted_alignment.bam,MT_S60_5,../2017_006B_5_MT_S60_sorted_alignment.bam,../SPP_Peaks/2017_006B_5_MT_S60_sorted_alignment_VS_2017_006B_S5_MT_S60_sorted_alignment.narrowPeak,narrow

WT_S90,WT,Rif1,S90,2,../2017_006B_S4_WT_S90_sorted_alignment.bam,WT_S90_4,../2017_006B_4_WT_S90_sorted_alignment.bam,../SPP_Peaks/2017_006B_4_WT_S90_sorted_alignment_VS_2017_006B_S4_WT_S90_sorted_alignment.narrowPeak,narrow

MT_S90,MT,Rif1,S90,2,../2017_006B_S8_MT_S90_sorted_alignment.bam,MT_S90_8,../2017_006B_8_MT_S90_sorted_alignment.bam,../SPP_Peaks/2017_006B_8_MT_S90_sorted_alignment_VS_2017_006B_S8_MT_S90_sorted_alignment.narrowPeak,narrow

Then ran:

data<- dba(sampleSheet = samples)

Giving:

data
16 Samples, 1079 sites in matrix (1896 total):

       ID Tissue Factor Treatment Replicate Caller Intervals
1   WT_HU     WT   Rif1        HU         1   macs      1055
2   MT_HU     MT   Rif1        HU         1   macs      1461
3   WT_G1     WT   Rif1        G1         1   macs       631
4   MT_G1     MT   Rif1        G1         1   macs       772
5  WT_S60     WT   Rif1       S60         1   macs       289
6  MT_S60     MT   Rif1       S60         1   macs       371
7  WT_S90     WT   Rif1       S90         1   macs       443
8  MT_S90     MT   Rif1       S90         1   macs       517
9   WT_HU     WT   Rif1        HU         2 narrow       215
10  MT_HU     MT   Rif1        HU         2 narrow       515
11  WT_G1     WT   Rif1        G1         2 narrow       106
12  MT_G1     MT   Rif1        G1         2 narrow       171
13 WT_S60     WT   Rif1       S60         2 narrow       116
14 MT_S60     MT   Rif1       S60         2 narrow        48
15 WT_S90     WT   Rif1       S90         2 narrow       115
16 MT_S90     MT   Rif1       S90         2 narrow        57

After running the dba.count function, I'm losing all of the information for Replicate 2?

data<- dba.count(data, summits=T)

data

8 Samples, 1079 sites in matrix:
      ID Tissue Factor Treatment Replicate Caller Intervals FRiP
1  WT_HU     WT   Rif1        HU         1 counts      1079 0.11
2  MT_HU     MT   Rif1        HU         1 counts      1079 0.11
3  WT_G1     WT   Rif1        G1         1 counts      1079 0.11
4  MT_G1     MT   Rif1        G1         1 counts      1079 0.11
5 WT_S60     WT   Rif1       S60         1 counts      1079 0.11
6 MT_S60     MT   Rif1       S60         1 counts      1079 0.11
7 WT_S90     WT   Rif1       S90         1 counts      1079 0.11
8 MT_S90     MT   Rif1       S90         1 counts      1079 0.12

This isn't the case in the vignette? After dba.count using the tamoxifen data the replicate information is kept? I've tried doing this giving each of my samples different IDs (suffixed _m or _s) and I'm still seeing the same, with only replicate one being kept.

Why is this the case and should I be approaching this differently? Happy to send over any files if needed.

Thanks!

Sophie

FYI:

sessionInfo()

R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.6

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
[1] DiffBind_2.2.12            SummarizedExperiment_1.4.0 Biobase_2.34.0            
[4] GenomicRanges_1.26.4       GenomeInfoDb_1.10.3        IRanges_2.8.2             
[7] S4Vectors_0.12.2           BiocGenerics_0.20.0       

loaded via a namespace (and not attached):
 [1] Category_2.40.0          bitops_1.0-6             bit64_0.9-7             
 [4] RColorBrewer_1.1-2       tools_3.3.2              backports_1.1.0         
 [7] R6_2.2.2                 rpart_4.1-11             KernSmooth_2.23-15      
[10] Hmisc_4.0-3              DBI_0.7                  lazyeval_0.2.0          
[13] colorspace_1.3-2         nnet_7.3-12              gridExtra_2.2.1         
[16] DESeq2_1.14.1            bit_1.1-12               sendmailR_1.2-1         
[19] graph_1.52.0             htmlTable_1.9            rtracklayer_1.34.2      
[22] caTools_1.17.1           scales_0.4.1             checkmate_1.8.2         
[25] BatchJobs_1.6            genefilter_1.56.0        RBGL_1.50.0             
[28] stringr_1.2.0            digest_0.6.12            Rsamtools_1.26.2        
[31] foreign_0.8-69           AnnotationForge_1.16.1   XVector_0.14.1          
[34] htmltools_0.3.6          base64enc_0.1-3          pkgconfig_2.0.1         
[37] limma_3.30.13            htmlwidgets_0.8          rlang_0.1.1             
[40] RSQLite_2.0              BBmisc_1.11              bindr_0.1               
[43] GOstats_2.40.0           hwriter_1.3.2            BiocParallel_1.8.2      
[46] gtools_3.5.0             acepack_1.4.1            dplyr_0.7.1             
[49] RCurl_1.95-4.8           magrittr_1.5             GO.db_3.4.0             
[52] Formula_1.2-1            Matrix_1.2-10            Rcpp_0.12.11            
[55] munsell_0.4.3            stringi_1.1.5            edgeR_3.16.5            
[58] zlibbioc_1.20.0          gplots_3.0.1             fail_1.3                
[61] plyr_1.8.4               grid_3.3.2               blob_1.1.0              
[64] gdata_2.18.0             lattice_0.20-35          Biostrings_2.42.1       
[67] splines_3.3.2            GenomicFeatures_1.26.4   annotate_1.52.1         
[70] locfit_1.5-9.1           knitr_1.16               rjson_0.2.15            
[73] systemPipeR_1.8.1        geneplotter_1.52.0       biomaRt_2.30.0          
[76] XML_3.98-1.9             glue_1.1.1               ShortRead_1.32.1        
[79] latticeExtra_0.6-28      data.table_1.10.4        gtable_0.2.0            
[82] amap_0.8-14              assertthat_0.2.0         ggplot2_2.2.1           
[85] xtable_1.8-2             survival_2.41-3          tibble_1.3.3            
[88] pheatmap_1.0.8           GenomicAlignments_1.10.1 AnnotationDbi_1.36.2    
[91] memoise_1.1.0            bindrcpp_0.2             cluster_2.0.6           
[94] brew_1.0-6               GSEABase_1.36.0

diffbind • 766 views

ADD COMMENT • link updated 6.5 years ago by Rory Stark ★ 5.1k • written 6.5 years ago by Sophie Shaw • 0

score 1 · Accepted Answer · 2017-10-16

Hi Sophie-

Basically, if you don't have actual replicates you can't create them -- the multiple-peak-caller trick is only useful to capture some consensus peaks you might have otherwise missed, but once a consensus peakset is determined, we can only count overlapping reads once for each bam file. You may have see it done in the original paper with DiffBind (Ross-Innes et al. Nature 2012), where we used multiple peak callers on our breast cancer tumor samples. However this only works for the initial part of the analysis (the "occupancy analysis" in the documentation), where you are determining the consensus peakset to use for the binding matrix.

You are using the default consensus peakset, consisting of all peaks that overlap in at least two peaksets, meaning peaks that overlap in at least a) two peak calls for a single sample, or b) in peaks for two different samples. When you invoke dba.count() with this set, the reads for all samples are counted for the exact same set of (consensus) peaks. At this point, it only matters how many sets of reads (bam files) there are. It is of no value to count the reads from the exact same bam file for the exact same peak region more than once. After counting, the set of samples is reduced to the number of unique sets of reads (ie. the number of bam files), so the information about which peak caller was used is no longer informative.

Cheers-

Rory