Question

ChIPseq diffbind merging peaks

0

Entering edit mode

veronique.storme • 0

@veroniquestorme-12161

Last seen 7.0 years ago

Dear,

I am new to differential ChIPseq analysis. One condition has 6 biological replicates and the other 3. I did not perform the experiment, I just perform the analysis. I started with loading the data:

prol.10 = dba(sampleSheet="macs10_30.csv")

This resulted in a total of 7602 peaks with 47 present in at least 2 samples. This indicates that the experiment was not very reproducible. I further looked into peaks in common between biological replicates and the result was very disappointing:

dba.overlap(prol.10, prol.10$masks$elo, mode=DBA_OLAP_RATE)

resulted in 1408 0 0

and
dba.overlap(prol.10, prol.10$masks$pro, mode=DBA_OLAP_RATE)

6206 35 2 0 0 0

My question is now: how is it possible that I still find 16 peaks to be differentially bound out of the 45 after peak merging when no peaks were found in all biological replicates? I read in another post that dba.count re-counts the overlapping reads for every consensus peak for every sample, whether or not that peak was identified in that sample. When I extract the reads, I find for 28 peaks reads higher than 5, while 0 overlapping peaks were found. I would expect reads in one biological replicate and reads below 5 for the other replicates. This is my code:

prol.10 = dba.count(prol, summits=250)

contrast.10 = dba.contrast(prol.10, categories=DBA_CONDITION) de.10 = dba.analyze(contrast.10)

reads.10 = dba.count(prol.10, peaks=NULL, score=DBA_SCORE_READS) bindingMatrix.10 = dba.peakset(reads.10, bRetrieve=TRUE, DataType=DBA_DATA_FRAME)

counts.10 = bindingMatrix.10[,4:ncol(bindingMatrix.10)]

Should I change the summits option?

I would very much value your input,

Thanks, Veronique

diffbind merging_peaks • 1.8k views

ADD COMMENT • link 7.2 years ago veronique.storme • 0

score 0 · Answer 1 · 2017-02-16

Hello Veronique-

The peak calling step (in this case via MACS) can be imprecise, especially in “borderline” peaks. As you use a hard threshold, some regions may just miss being identified as peaks.

You are correct that DiffBind re-counts overlapping reads for all consensus peaks in all samples. I don’t think that the summits parameter is causing an issue here, as it is serving to re-center and normalise the width of the enriched intervals. It is certainly possible that you identify 16 of the 47 consensus peaks as being significantly differential bound if the variance between the biological replicates at these loci is low.

It could be useful to look at the data in a report that includes the normalized read counts:

> report.10 <- dba.report(de.10, bCounts=TRUE)

This will give you the mean normalised read count for each of the two sample groups, their fold change, and the normalized read counts for each sample. If the read concentrations are low, and/or the absolute values of the fold changes are low, you may be detecting very small changes, or changes from no binding to very low (but consistent) binding. You can see similar data graphically using dba.plotMA().

If you like, I can take a look at the data to see if anything looks unusual. You can email me the DBA objects prol.10 before and after calling dba.count(). If the objects are more than a few MB, you can send me a link on Dropbox or something similar.

Cheers-
Rory

score 0 · Answer 2 · 2017-03-02

Hi Véronique-

I had a look at the data you sent me. As you say, there is very low agreement on the peak calling:

> dba.overlap(prol.10,mode=DBA_OLAP_RATE)
[1] 7602   47    2    0    0    0    0    0    0

As you note, this is suggestive of a non-reproducible experiment.

I think I may be able to shed some light on the discrepancies you are seeing when extracting normalized and non-normalized read counts form the binding matrix. The big difference I see is due to the reads in the Control tracks. When you use score=DBA_SCORE_READS, you are getting the counts in the ChIP track only. However, by default, dba.analyze() subtracts the counts in the control tracks before normalizing. To see the count values that are actually being used, try:

> reads.10 <- dba.count(prol.10, peaks=NULL, score=DBA_SCORE_READS_MINUS)
> bindingMatrix.10 <- dba.peakset(reads.10, bRetrieve=TRUE, DataType=DBA_DATA_FRAME)

You will notice that many of the values are negative, indicating that there are more reads in the Control track than there are in the ChIP track. When doing the analysis, DiffBind will set these values to 1 before normalizing, resulting in the very low normalized count values you are seeing. Note that having large signals (or even just much deeper coverage) in the Control tracks can throw off the peak callers, so this may also account for the low level of agreement in peaksets.

You can work with just the read counts (without subtracting the Control reads) by setting bSubControl=FALSE in the call to dba.analyze():

> de.10 <- dba.analyze(contrast.10,bSubControl=FALSE)

If you do this, you should see greater agreement in the read count values.

-Rory

score 0 · Answer 3 · 2017-03-02

0

Entering edit mode

veronique.storme • 0

@veroniquestorme-12161

Last seen 7.0 years ago

Dear Rory,

Thanks a lot for your time and effort, this was really helpful,

Veronique

ADD COMMENT • link 7.2 years ago veronique.storme • 0