Question

Diffbind counts vs. MACS2 scores

0

Entering edit mode

mm2489 ▴ 20

@mm2489-7705

Last seen 8.5 years ago

United States

Hi Rory,

I have a question I was hoping you could clarify for me.

So in my chip-seq experiment, I have 2 conditions (2 biological replicates for each). Whenever I visualize the peaks in IGV or plot the MACS2 output bed scores in R, I see that overall, all counts are much lower in one condition compared to the other.

However, whenever I compare counts from the consensus peakset generated by DiffBind, the dramatic difference is gone.

I don't exactly understand the source of discrepancy between IGV peak intensities and DiffBind counts.

I would really appreciate your help

diffbind • 3.8k views

ADD COMMENT • link 10.8 years ago mm2489 ▴ 20

score 1 · Answer 1 · 2015-05-08

The answer could have to do with how the data have been normalized.

There are a number of ways you can verify that the counts are working the way you expect, and what the normalization is doing.

To see the raw read counts, instead of normalized scores, you can set the score to DBA_SCORE_READS. You can switch between scores without having to recount:

> DBA <- dba.count(DBA, peaks=NULL, score=DBA_SCORE_READS)

Then when you retrieve the binding matrix with dba.peakset() you'll have raw reads, and you can see if these agree with what you see in IGV.

If you've done an analysis,even if it doesn't turn up any differentially bound sites, you can see even more useful stuff. For example,

> par(mfrow=c(1,2))
> dba,plotBox(DBA, contrast=1, bAll=TRUE, bDB=FALSE, bDBIncreased=FALSE, bDBDecreased=FALSE, bNormalized=FALSE)
> dba,plotBox(DBA, contrast=1, bAll=TRUE, bDB=FALSE, bDBIncreased=FALSE, bDBDecreased=FALSE, bNormalized=TRUE)

will give you boxplots of the counts. If you see a big difference in the first one, but they look even in the second one, that means the normalization is responsible.

I always look at MA plots, one normalized and one not, to see the effect of the normalization:

> par(mfrow=c(1,2))
> dba.plotMA(DBA, bNormalized=FALSE)
> dba.plotMA(DBA, bNormalized=TRUE)

If the most dense part of the first plot is above or below the line, but on the line in the second one, again, this is the normalization.

You can also retrieve raw and normalized scores int he report:

> reportRaw  <- dba.report(DBA, contrast=1, th=1, bNormalized=FALSE)
> reportNorm <- dba.report(DBA, contrast=1, th=1, bNormalized=TRUE)

If the issue is not normalization, let me know and we can trouble shoot. If it is, then you have to think carefully about whether the normalization is removing technical variance or if it is removing "real biological" variance. Most normalization methods assume to some degree that the bulk of the signal should be similar between the two groups. If you have a case where there is very little binding in one group, and a lot more binding in the other, this poses a challenge to normalizing. The TMM method that DiffBind uses by default (from edgeR) can handle this to a certain extent, but if it is too extreme wit will end up equalising things that though remain different. This is a problem for all these read-based sequencing experiments, including RNA-seq, I recommend reading the TMM paper by Robinson and Oshlack: http://genomebiology.com/2010/11/3/r25

Cheers-

Rory