Question

DiffBind - question on the use of filtered or unfiltered input data

0

Entering edit mode

Oscar • 0

@c597e8d8

Last seen 18 months ago

Spain

Currently I'm running DiffBind using bam/bed files that have been filtered to keep only the highly reproducible peaks (HR-peaks), obtained using the IDR method (for reference: https://hbctraining.github.io/Intro-to-ChIPseq/lessons/07_handling-replicates-idr.html), obtaining a good amount of DB peaks.

However, in a discussion with my colleagues a point was raised mentioning that the input for DiffBind should be all peaks on the dataset and not only the HR ones, resulting in way less DB peaks for some of our marks/conditions when using all the peaks (and not only the HR ones). I find this a little bit conflicting with what I know of DeSeq2 (which, if I understood it correctly, Diffbind uses some routines carried out by DeSeq2 too), so I was wondering if you could clarify this question? I assumed that I could use the HR peaks from the line below (found in the DiffBind vignette, page 4): (..) Peaksets are derived either from ChIP-Seq peak callers, such as MACS, or using some other criterion (...)

I'm running DiffBind with mostly default parameters but just in case here is the code:

peaksets <- dba(sampleSheet = samples)
readcount <- dba.count(peaksets)
normalized <- dba.normalize(readcount)
comparisson <- dba.contrast(normalized, contrast = c("Factor","challenge","control"), minMembers = 2)
diffanalysis <- dba.analyze(comparisson, bParallel = FALSE, bBlacklist = FALSE, bGreylist = FALSE)

Thank you so much for your time!

Óscar

DiffBind • 1.6k views

ADD COMMENT • link updated 18 months ago by householdelsewhere • 0 • written 19 months ago by Oscar • 0

0

Entering edit mode

This seems at odds with what I understand about DeSeq2 (and, if I'm not mistaken, Diffbind also makes use of certain DeSeq2–carried–out algorithms), so I was hoping you might shed some light on the matter. I had first thought that the HR peaks from the following line (page 4 of the DiffBind vignette) may be used: (..) In ChIP-Seq, peak callers like MACS are used to identify peaks, and these are then used as one of several criteria to generate peaksets. drift hunters

ADD REPLY • link 18 months ago householdelsewhere • 0

score 0 · Answer 1 · 2023-04-20

0

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 28 days ago

Cambridge, UK

Yes, you can certainly use a consensus peakset derived in this manner using IDR. (I assume you are calling peaks=HR-peaks when you call dba.count(), or are you setting this as the peaksets for all the samples in the samplesheet? either way works).

I wouldn't necessarily expect "way less" peaks using a larger consensus set, but it can happen. Do you know the difference in peak numbers between the all-in consensus and the HR-peaks? If the include-all-peaks option involves a much larger number of much lower quality peaks, I could see how many of the highly reproducible differentially bound sites could end up below the FDR threshold due to multiple testing correction.

I suspect that your approach of using IDR-reived HR peaks is a good one; an alternative would be to look at the decay of the overlap curve and set minOverlap to a larger number.

ADD COMMENT • link 19 months ago Rory Stark ★ 5.2k

0

Entering edit mode

Thanks Rory. The way I'm doing this is, after obtaining the HR-peaks, I filter the peak bed files to keep only the peaks that overlap with my HR-list. Those filtered ones are the files I refer to in the samplesheet Peaks column. I leave the bam files (bamReads column) unfiltered.

Bellow the summary of the number of DiffBinded peaks I get for 3 marks, and 4 comparisons using all the peaks

All peaks   ATAC    H3K4me3 H3K27ac
Comparison_A    0   3   1
Comparison_B    20992   12749   158
Comparison_C    31  1009    55
Comparison_D    31341   11483   0

and now using the HR peaks only as described above

HR peaks    ATAC    H3K4me3 H3K27ac
Comparison_A    21352   3979    1797
Comparison_B    48558   6271    5188
Comparison_C    45255   4072    13374
Comparison_D    36815   3598    12611

ADD REPLY • link 19 months ago Oscar • 0

0

Entering edit mode

It would help to see more of your script and intermediate results. How are you specifying the consensus peakset (are you using minOverlap=1)? Have you looked at the overlap rate to see how well the peak calls agree? How many peaks are in the keep-all consensus peakset? How are you setting up the contrasts? How are you normalizing? What do the MA plots look like, raw vs normalized, HR-list vs keep-all?

The IDR method should be fine. If you really want to get to the bottom of why the results are so different using a broader set of consensus peaks, it would take some digging.

ADD REPLY • link 19 months ago Rory Stark ★ 5.2k

0

Entering edit mode

1) Regarding the consensus peakset, I am leaving the minOverlap=2 as default, haven't changed it to 1, should I?.

2) In my keep-all set the number of consensus peaks is as follows:

ATAC: 50702
H3K4me3: 14377
H3K27ac: 5904

3) For my contrasts, I'm doing:

comparisson <- dba.contrast(normalized, contrast = c("Factor","challenge","control"), minMembers = 2)

4) Normalization also not specified, by default (normalize=DBA_NORM_DEFAULT)

5) Now, for the MA plots, these are the ones I'm getting for one of the comparisons. For normalized HR-list: HR_peaks, normalized

and for the normalized keep-all list: keep_all peaks, normalized

For the HR-peaks MAplot skipping normalization I am getting the following error while establishing the contrast:

Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc,  : 
  every gene contains at least one zero, cannot compute log geometric means

I have yet to generate the unormalized keep_all one, it might take a while (I haven't recorded all dba objects, so sorry!

ADD REPLY • link 19 months ago Oscar • 0