Question

Generating unexpected peaks from DiffBind

0

Entering edit mode

michael.alexander.finger • 0

@michaelalexanderfinger-9420

Last seen 7.6 years ago

Hello. The lab I'm working with is comparing H3K4me3 binding in two different conditions of yeast cells using ChIP-seq. I am attempting to use DiffBind with the narrowPeak files generated from MACS2, along with bam files generated from mapping, to generate a set of peaks where differential binding is occuring between the two conditions for downstream analysis. However, the set of peaks I am able to generate does not seem suitable for further analysis, as peaks are not showing up in locations where we expect to see differential binding, as evidenced by visual inspection of the original peak files, and other experimental data. These types of known differential-binding points are my primary benchmark for checking to see if my data makes sense.

My hope is that someone can help explain why I might be getting results like these, and how I might tweak DiffBind's parameters to get better results, or if I'm not understanding how DiffBind is intended to work in the first place. I am new to both bioinformatics and computational genomics with little formal guidance, so I must ask for your patience. I will briefly explain the steps taken, as well as the output.

Firstly, I have a sample sheet containing the information for the peaks files and bam files containing read counts. There are two conditions, with two replicates for each:
SampleID   Tissue   Factor   Condition   Treatment   Replicate   bamReads   ControlID   bamControl   Peaks   PeakCaller
wtK4t1   -   K4   wt   -   1   E:/data/mapped_bwa/wt_t1_K4.bam   wtK4t1ct   E:/data/mapped_bwa/wt_t1_input.bam   E:/data/peaks/wt_t1_K4_tolarge_peaks.narrowPeak   bed
wtK4t2   -   K4   wt   -   2   E:/data/mapped_bwa/wt_t2_K4.bam   wtK4t2ct   E:/data/mapped_bwa/wt_t2_input.bam   E:/data/peaks/wt_t2_K4_tolarge_peaks.narrowPeak   bed
2koK4t1   -   K4   2ko   -   1   E:/data/mapped_bwa/2ko_t1_K4.bam   2koK4t1ct   E:/data/mapped_bwa/2ko_t1_input.bam   E:/data/peaks/2ko_t1_K4_tolarge_peaks.narrowPeak   bed
2koK4t2   -   K4   2ko   -   2   E:/data/mapped_bwa/2ko_t2_K4.bam   2koK4t2ct   E:/data/mapped_bwa/2ko_t2_input.bam   E:/data/peaks/2ko_t2_K4_tolarge_peaks.narrowPeak   bed

Next, I have used the sample code shown in the DiffBind vignette to get a GRanges object containing the differential peaks:

samples<-read.csv("E:/parameters/h3k4samplesheet_K4.csv")
sk1 <- dba(sampleSheet=samples)
sk1 <- dba.count(sk1, minOverlap = 2)
sk1 <- dba.contrast(sk1, categories=DBA_CONDITION, minMembers=2)
sk1 <- dba.analyze(sk1)
sk1.DB <- dba.report(sk1)

I chose minOverlap of 2 for dba.count(), but I am unsure if this means the peak must show up in both replicates for a given condition, or two out of any of the peak files. I am also unclear on how DiffBind chooses a final consensus peak when two very similar peaks have slight differences.

Finally, I have a function that outputs the sk1.DB GRanges object as a bed file file viewing in IGV, along side the original peak files. At the bottom of this post a screenshot (with some non-relevant information removed) showing the peaks at a particular location which I will call gene A. We know from other experiments that H3K4me3 presence at gene A's location is elevated in condition 2, compared to condition one. The original peak files agree, with no apparent peak in either replicate in condition one, and nearly identical, strong peaks in both replicates of condition two. To my understanding, DiffBind's output (labeled K4Diff) should show a peak in this position representing a difference in binding, but as you can see in the image, this is not the case. The same sort of thing happens in other locations we expect to have differential binding.

Again, am I using DiffBind wrong, or am I not understanding what DiffBind is intended to output? I'm unsure of how to proceed from here, so any help is appreciated. I have viewed the DiffBind manual in addition to the vignette, but I can't seem to understand which functions/parameters might be useful to me. If any more information or clarification is needed, I will do my best to provide it.

IGV Screenshot

diffbind chipseq • 2.3k views

ADD COMMENT • link updated 8.3 years ago by Rory Stark ★ 5.1k • written 8.3 years ago by michael.alexander.finger • 0

score 1 · Answer 1 · 2015-12-31

Hello-

To answer the easy question first, setting minMembers=2 will include any peak that appears in any any two sample peaksets, regardless of whether they are represent the same or different conditions.

As for the reason why your are not seeing peaks identified as differentially bound where you expect them to be, there are a number of possible reasons for this, and some steps to cantata to understand better what may be going on.

Firstly, as you only have two replicates of each condition, the experiment is almost certainly underpowered, such that you will not be able to detect many changes with high confidence. That may or my not the be issue, but it is something to keep in mind.

The first thing I would do is get more detailed reports to examine. If you get a report as follows:

sk1.DB <- dba.report(sk1, th=1, bCounts=TRUE)

you will get back a GRanges object with all the peaks (as the FDR threshold is set to 1), and also with all the normalized counts for each sample. Using this, you can go look for specific regions, such as the one you have prior information about, and see how the counts looks for the four samples.

Looking at the genome browser show you included, even at the peak calling level, the only place I see a peak called in both the 2ko samples but not the wt samples is the rightmost peak (just past 390kb). It looks like they are called more weakly, and at different widths, as well; this may be too variable to call with two replicates.

I would suggest trying two things:

First, re-center the peaks around the summits to get narrower peaks, which may focus the analysis on less variable regions. All you need to do is to set the summits parameter in dba.count():

sk1 <- dba.count(sk1, minOverlap = 2, summits=200)

This will give you peaks of uniform 400bp width.

Second, try running both edgeR and DESeq2 analyses. The main difference you may see is in the normalization. You mention that you expect a gain of binding in one condition in a certain area; if this is very widespread, the TMM method in edgeR may not be appropriate. You can run these two analyses together as follows:

sk1 <- dba.analyze(sk1, method=c(DBA_EDGER,DBA_DESEQ2))

and then you can get a report for each:

sk1.DB.edgeR <- dba.report(sk1, th=1, bCounts=TRUE, method=DBA_EDGER) sk1.DB.DESeq2 <- dba.report(sk1, th=1, bCounts=TRUE, method=DBA_DESEQ2)

I would check MA plots as well.

Hope this helps-

Rory