Hi, I am trying to carry out some differential peak analysis on histone modification data (H3K27Ac data).
I run into problems where one peak in particular is not being called as differentially expressed, where clearly in the data is should be.
In order to try and find out where the problem is I am trying to extract a bed file of the consensus peaks to make sure diffbind is identifying the peak as a consensus peak for the analysis.
My data is very simple. two treatments (etc, vitd) and two donors, see sample sheet print off below:
# A tibble: 4 x 11
SampleID Tissue Factor Condition Treatment Replicate bamReads ControlID bamControl
<chr> <chr> <chr> <chr> <chr> <int> <chr> <chr> <chr>
1 k27aceth1 cd4 k27ac eth a 1 Bowtie_1_27Ac_Eth.bam natinput Bowtie_1_NInput_Eth.bam
2 k27aceth2 cd4 k27ac eth a 2 Bowtie_2_27Ac_Eth.bam natinput Bowtie_1_NInput_Eth.bam
3 k27acvitd1 cd4 k27ac vitd a 1 Bowtie_1_27Ac_VitD.bam natinput Bowtie_1_NInput_Eth.bam
4 k27acvitd2 cd4 k27ac vitd a 2 Bowtie_2_27Ac_VitD.bam natinput Bowtie_1_NInput_Eth.bam
# ... with 2 more variables: Peaks <chr>, PeakCaller <chr>
The code I used for differential peak analysis is as follows:
> samples <- dba(sampleSheet=diffbind_samples, peakCaller = "bed", peakFormat = "bed")
> samples_counts <- dba.count(samples)
> samples_counts <- dba.contrast(samples_counts, categories=DBA_CONDITION, minMembers=2)
> samples_counts <- dba.analyze(samples_counts)
and then exported the results as follows:
> samples_counts.DB <- dba.report(samples_counts, th=0.1, fold=0.585)
> write.table(samples_counts.DB, "samples_counts_0.1_0.586.txt", sep="\t", row.names = F)
So I guess my main question is how to retrieve the regions that where used for the analysis, I have tried the following:
> samples_consensus <- dba.peakset(samples, consensus=c(DBA_TISSUE,DBA_CONDITION), minOverlap=1)
> samples_consensus <- dba(samples_consensus, mask=samples_consensus$masks$Consensus,minOverlap=1)
> consensus_peaks <- dba.peakset(samples_consensus, bRetrieve=TRUE)
> write.table(consensus_peaks, "consensus_peaks.txt", sep="\t")
Below is an example of what the output gave me when visualised in IGV. The top four tracks of the top section are the bigwig of the bamfiles, then in the bottom section, the two blue bars are the peaks called of the red peaks in bigwig and just below it is the "consensus_peaks" defined above and as you can see even though the peaks are called in the original analysis, it is not idnentified as a consensus peak.
Here is the sessionInfo() output:
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] readr_1.1.1 DiffBind_2.4.8 SummarizedExperiment_1.6.3
[4] DelayedArray_0.2.7 matrixStats_0.52.2 Biobase_2.36.2
[7] GenomicRanges_1.28.4 GenomeInfoDb_1.12.2 IRanges_2.10.2
[10] S4Vectors_0.14.3 BiocGenerics_0.22.0
loaded via a namespace (and not attached):
[1] Category_2.42.1 bitops_1.0-6 bit64_0.9-7 RColorBrewer_1.1-2
[5] httr_1.3.1 tools_3.4.1 backports_1.1.0 R6_2.2.2
[9] rpart_4.1-11 KernSmooth_2.23-15 Hmisc_4.0-3 DBI_0.7
[13] lazyeval_0.2.0 colorspace_1.3-2 nnet_7.3-12 gridExtra_2.2.1
[17] DESeq2_1.16.1 bit_1.1-12 compiler_3.4.1 sendmailR_1.2-1
[21] graph_1.54.0 htmlTable_1.9 plotly_4.7.1 rtracklayer_1.36.4
[25] caTools_1.17.1 scales_0.5.0 checkmate_1.8.3 BatchJobs_1.6
[29] genefilter_1.58.1 RBGL_1.52.0 stringr_1.2.0 digest_0.6.12
[33] Rsamtools_1.28.0 foreign_0.8-69 AnnotationForge_1.18.1 XVector_0.16.0
[37] base64enc_0.1-3 pkgconfig_2.0.1 htmltools_0.3.6 limma_3.32.5
[41] htmlwidgets_0.9 rlang_0.1.2 RSQLite_2.0 BBmisc_1.11
[45] GOstats_2.42.0 bindr_0.1 hwriter_1.3.2 jsonlite_1.5
[49] BiocParallel_1.10.1 gtools_3.5.0 acepack_1.4.1 dplyr_0.7.2
[53] RCurl_1.95-4.8 magrittr_1.5 GO.db_3.4.1 GenomeInfoDbData_0.99.0
[57] Formula_1.2-2 Matrix_1.2-11 Rcpp_0.12.12 munsell_0.4.3
[61] stringi_1.1.5 edgeR_3.18.1 zlibbioc_1.22.0 fail_1.3
[65] gplots_3.0.1 plyr_1.8.4 grid_3.4.1 blob_1.1.0
[69] ggrepel_0.6.5 gdata_2.18.0 lattice_0.20-35 Biostrings_2.44.2
[73] splines_3.4.1 GenomicFeatures_1.28.4 annotate_1.54.0 hms_0.3
[77] locfit_1.5-9.1 knitr_1.17 rjson_0.2.15 systemPipeR_1.10.2
[81] geneplotter_1.54.0 biomaRt_2.32.1 XML_3.98-1.9 glue_1.1.1
[85] ShortRead_1.34.0 latticeExtra_0.6-28 data.table_1.10.4 gtable_0.2.0
[89] purrr_0.2.3 tidyr_0.7.0 amap_0.8-14 assertthat_0.2.0
[93] ggplot2_2.2.1 xtable_1.8-2 survival_2.41-3 viridisLite_0.2.0
[97] pheatmap_1.0.8 tibble_1.3.4 GenomicAlignments_1.12.2 AnnotationDbi_1.38.2
[101] memoise_1.1.0 bindrcpp_0.2 cluster_2.0.6 brew_1.0-6
[105] GSEABase_1.38.1
Hi,
Could you report the output of
sessionInfo()
please? Some older versions of DiffBind had a bug that under some cases caused peaks to be reported at incorrect locations. (You should always include the output ofsessionInfo()
when you ask a question.)- Gord
Hi Gord,
Have added to original post, Sorry I will remember this in the future. Seems it is version 2.4.8?
-Reuben
Thanks, that looks current, so the old bug shouldn't be a problem. Unfortunately I can't see anything obvious. The default for
dba.count
isminOverlap=2
, so presence of a peak in two samples should be sufficient.Could you post or email your sample sheet? And the results of printing the dba objects after each of the first 4 function calls (
dba
todba.analyze
).To get a report, you use the
dba.report
function, which has a whole lot of parameters determining what you get back. In this case, you will want at least:th=1
so all sites, even non-significant ones, are reportedbCalled=TRUE
to see which peaks were called (note you need to calldba.count
withbCalledMasks=TRUE)
bCounts=TRUE
to see the counts for each peakDataType=DBA_DATA_FRAME
(probably easiest to work with)You may want to use the "file" parameter to get it all written to a file easily.
If I can't work out what's going on with all that, then I'll have to drag Rory into the discussion. He knows the internals much better than I do.
Thanks for this, I learned a bit about the dab.report,
Here is my sample sheet:
1
k27aceth1
cd4
k27ac
eth
a
1
Bowtie_1_27Ac_Eth.bam
natinput
Bowtie_1_NInput_Eth.bam
27ac.eth.don1.peaks.styleregion.250bp.mindist5kb_diffbind.bed
bed
2
k27aceth2
cd4
k27ac
eth
a
2
Bowtie_2_27Ac_Eth.bam
natinput
Bowtie_1_NInput_Eth.bam
27ac.eth.don2.peaks.styleregion.250bp.mindist5kb.diffbind.bed
bed
3
k27acvitd1
cd4
k27ac
vitd
a
1
Bowtie_1_27Ac_VitD.bam
natinput
Bowtie_1_NInput_Eth.bam
27ac.vitD.don1.peaks.styleregion.250bp.mindist5kb_diffbind.bed
bed
4
k27acvitd2
cd4
k27ac
vitd
a
2
Bowtie_2_27Ac_VitD.bam
natinput
Bowtie_1_NInput_Eth.bam
27ac.vitD.don2.peaks.styleregion.250bp.mindist5kb.diffbind.bed
bed
After doing the suggested DBA.report, I can identify the peak in question and the report shows something like this;
If I am interpreting this right the normalised values are -0,21 in etc and 8.35 in vitD which is a 8.56 fold increase from eth to vitd with an FDR of 8.52E-11. I am not quite sure how to interpret the Called1 and Called2 columns however? And the final k27aceth1 etc are the counts in each sample, which in this case show k27ac going from around 1 in both etc to over 300 in vitd treatment?
Let me know if this helps?
Again thank you for your help.
Called1
andCalled2
report the number of samples in each group that had the peak identified in the original peak calling. In this example, this peak was not identified in either of theeth
samples, but it was called in both of thevitd
samples.-R
This tells us that DiffBind is indeed finding the difference extremely significant. I'll have to leave it to Rory to understand what's happening with the output.
Ok I have an update, when I switched around my sample sheet (I did this purely as I wanted to have the vitd samples as the positive fold change values) to have the vitd samples at the top, the peak in question is now in the output of
So the exact issue above seems to be resolved.
However, I have now spotted another issue which perhaps you or Rory can help me with?
I have histone, broad-peak data, however some of the peaks are actually rather narrow whilst some are broad. So what I can see happening now is that the broader peaks are being nicely identified as differentially expressed by diffbind, but the narrow peaks which seem highly up-regulated (though big fold changes) are not reaching FDR significance.
I wonder whether this is something you have come across before? I have tried playing around with "summits" function and "bFullLibrarySize" but not sure how I can best tackle this problem? To give you an idea here is the readout for one of the peaks in question:
Chr
Start
End
Conc
Conc_vitd
Conc_eth
Fold
p-value
FDR
Called1
Called2
k27acvitd1
k27acvitd2
k27aceth1
k27aceth2
chr2
102918383
102918857
7.02
7.79
5.23
2.55
0.00444
0.513
2
1
217.33
225.1
69.3
6
Let me know if you would like me to re-post as a separate question?
Changing the order of lines in the sample sheet will change the order of the (auto-generated) comparisons (i.e. control vs treatment as opposed to treatment vs control) (please confirm or correct, Rory). But if that's changing what shows up as significant, sounds like something's wrong. Rory? Any thoughts?
As to the new question, yes, a new question would be a good idea.