Question

DiffBind differential binding normalization with different levels of binding

0

Entering edit mode

igor ▴ 40

@igor

Last seen 10 months ago

United States

I have ChIP-seq with two conditions. In condition A, there should a lot of peaks (normal binding). In condition B, there should be none (no binding). Calling peaks with MACS confirms this. When you perform differential binding analysis, you would expect most of the significant peaks to be up in condition A. However, I get half the peaks up and half down. That does not make sense.

I am using DiffBind. My first instinct was that the peaks were being normalized, so in condition B (where there are few reads in the peak regions), those few reads get normalized to a much higher level. However, that should not be the case according to the manual:

When dba.analyze is invoked using the default method=DBA_EDGER, a standardized differential analysis is performed using the edgeR package. ... First, a matrix of counts is constructed for the contrast, with columns for all the samples in the first group, followed by columns for all the samples in the second group. The raw read count is used for this matrix; if the bSubControl parameter is set to TRUE (as it is by default), the raw number of reads in the control sample (if available) will be subtracted (with a minimum final read count of 1). Next the library size is computed for each sample for use in subsequent normalization. By default, this is the total number of reads in the library (calculated from the source BAM//BED file). Alternatively, if the bFullLibrarySize parameter is set to FALSE,the total number of reads in peaks (the sum of each column) is used.

That disproves my initial theory. Is there another explanation?

diffbind • 4.4k views

ADD COMMENT • link updated 8.6 years ago by Rory Stark ★ 5.2k • written 8.6 years ago by igor ▴ 40

1

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 14 days ago

Cambridge, UK

The bNormalized parameter to dba.plotMA() and dba.report() has no impact in how the analysis is carried out, it just uses non-normalized counts for the plot and/or report to show some underlying data. There is currently no way with DiffBind to perform an analysis without normalizing (using a normalization method intrinsic to the underlying differential analysis package)..

The TMM normalization used in edgeR is more complicated than just taking into account sequencing depth. For example, if one sample group has a set of enriched peaks A, and the other sample group has the same enriched peaks A plus gains an additional set of enriched peaks B, and they are all sequenced to the same depth, the same number of reads have to be divided between more peaks in the second group. Without normalizing, it will appear that the A peaks will be weaker in the second group when the truth is that the binding affinity hasn't changed. So we need to normalize for more than just library depth. (Another example is differing ChIP efficiencies.)

I encourage you to read the well-written TMM paper by Robinson and Oshlack.

Have you tried the DESeq2 method? I'm curious to see how you results differ between the two methods.

-R

ADD COMMENT • link 8.6 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Switching from edgeR to DESeq2 led to a significant improvement:

edgeR - 4,200 up, 4,600 down
DESeq2 - 300 up, 13,100 down

However, since one of the conditions has no binding (at least according to MACS), is there any way to get even more of an improvement?

ADD REPLY • link 8.6 years ago igor ▴ 40

0

Entering edit mode

Thanks for posting this info, it corresponds what I am seeing on similar projects.

At this point, DiffBind will only perform analyses with full normalization. In this case, it may be a good idea to either a) do the analysis at a finer granularity by using the edgeR and/or DESeq2 packages directly, and/or b) try another package. For this problem I suggest trying the csaw package. This package does not rely on MACS peaks -- it uses a windowing approach to identify deferentially enriched regions -- and while it also uses edgeR for normalization and analysis, it has a twist on the normalization procedure (using "background" regions) that may be more appropriate. I'd be very interested to know how a csaw analysis works on this dataset.

Cheers-

Rory

ADD REPLY • link 8.6 years ago Rory Stark ★ 5.2k

0

Entering edit mode

I have not tried csaw yet. However, I had one more followup question about normalization. Switching between edgeR and DESeq2 has big effects on the differential binding results. However, it has not effect on dba.plotPCA results. I thought that was using the normalized data. I am running it after dba.contrast but before dba.analyze. On a related note, doesn't the normalization happen at the dba.analyze step? Are there multiple normalization steps?

ADD REPLY • link 8.5 years ago igor ▴ 40

0

Entering edit mode

dba.plotPCA() uses one of two sets of values depending on whether you are doing a "global" plot or a plot specific to a contrast after calling dba.analyze(). If you do not specify a contrast, it uses the read scores (which default to TMM normalized reads from edgeR). If you do specify a contrast, it uses the normalised values specific to the analysis method. Note that the default method is edgeR, so if you want to see the DESeq2 normalized scores, you need to specify contrast= and method=DBA_DESEQ2.

ADD REPLY • link 8.5 years ago Rory Stark ★ 5.2k

0

Entering edit mode

I was doing a global plot (not specifying a contrast). So is it not possible to do a global plot based on DESeq2 normalization? Can I just set th=1 to trick it to use all peaks or would that have additional consequences?

ADD REPLY • link 8.5 years ago igor ▴ 40

0

Entering edit mode

The th parameter refers to a FDR threshold, and hence only applies to an analysis (where a contrast is specified). So that doesn't help. Currently, while DiffBind can use 16 different types of score values for global plots (see man page for dba.count()), these don't include any DESeq2 normalization scores.

When I want to plot using DESeq2 normalization scores, I just make a contrast with all the samples, do a DESeq2 analysis, then plot that contrast and set th=1 as you suggest.

Adding DESeq2 normalization as a global score is probably a good feature idea for a future release.

-R

ADD REPLY • link 8.5 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Is there a way to make a contrast with all the samples without starting from a new sample sheet?

Would you advise to use score=DBA_SCORE_RPKM if I would like to avoid any normalization biases (relevant in context of my original question where normalization makes a major difference)?

ADD REPLY • link 8.5 years ago igor ▴ 40

0

Entering edit mode

RPKM is pretty good for the clustering/PCA plots. You can change the score without having to recount as follows:

> myDBA <- dba.count(myDBA, peaks=NULL, score=DBA_SCORE_RPKM)

I also use DBA_SCORE_RPKM_FOLD sometimes for these plots.

To make a contrast with all the samples you can pick a few arbitrarily for one group and all the others in the other group, or use a metadata field where with exactly two values.

Example of creating an arbitrary contrast (supposing you have six samples):

> myDBA <- dba,contrast(myDBA,group1=1:3,group2=4:6)
> myDBA <- dba.analyze(myDBA, method=DBA_DESEQ2)

Example using a metadata field. For example if Condition had values WT and KO:

> myDBA <- dba.contrast(myDBA, categories=DBA_CONDITION)
> myDBA <- dba.analyze(myDBA, method=DBA_DESEQ2)

-R

ADD REPLY • link 8.5 years ago Rory Stark ★ 5.2k

score 2 · Accepted Answer · 2015-09-22

2

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 14 days ago

Cambridge, UK

Hello Igor-

Actually the problem is, indeed, normalization. When using the edgeR package, the default TMM normalization is used. Like most normalization methods, this relies on an assumption that there is a core set of peaks that are similarly enriched in both groups.

You can see the effects of the normalization after performing an analysis by comparing the MA plots with and without normalization:

> par(mfrow=c(2,1)
> dba.plotMA(myDBA, contrast=1)
> dba.plotMA(myDBA,contrast=1, bNormalized=FALSE)

We have noticed that in the most recent releases, the issue you are encountering has gotten more pronounced in experiments where one group has little or no binding. I'm looking into this (I suspect it relates to something that changed in edgeR). We have found that in these cases the normalization used by DESeq2 is less severe and may be more appropriate in this case. You can run both analyses and compare them:

> myDBA <- dba.analyze(myDBA, method=c(DBA_EDGER, DBA_DESEQ2) > par(mfrow=c(2,2) > dba.plotMA(myDBA, contrast=1, method=DBA_EDGER, bNormalized=FALSE) > dba.plotMA(myDBA, contrast=1, method=DBA_EDGER) > dba.plotMA(myDBA, contrast=1, method=DBA_DESEQ2,bNormalized=FALSE) > dba.plotMA(myDBA, contrast=1, method=DBA_DESEQ2)

This issue is one of great interest to us and we are looking at alternatives, such as performing only the most basic normalization (for library depth) in these cases.

Cheers-
Rory

ADD COMMENT • link 8.6 years ago Rory Stark ★ 5.2k

0

Entering edit mode

If I run dba.plotMA with and without bNormalized and see the difference, should I also run dba.report with bNormalized=F? Is that valid?

Also, why isn't bFullLibrarySize sufficient? If I use bNormalized=F, wouldn't that ignore sequencing depth differences (which are real)?

ADD REPLY • link 8.6 years ago igor ▴ 40

0

Entering edit mode

The plots worked. The methods look different as would be expected. However, bNormalized does not seem to have any impact. I explicitly tried setting it to both T and F and the results were the same for both methods.

ADD REPLY • link 8.6 years ago igor ▴ 40

0

Entering edit mode

bNormalized is only used to get some visual sense of what the normalization is doing (check out the y-axis scales esp), it doesn't change anything in the analysis.

ADD REPLY • link 8.6 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Hi Rory, a related question. I am using diffbind for H3K36me3 in 2 control vs 2 KO. we expect to see global H3K36me3 in KO samples.

because H3K36me3 is enriched in the gene body (peak calling does not work well for those diffused marks). I use the gene body as the peak regions for all samples. I want to use DBA_DESEQ2 and bFullLibrarySize = TRUE as I saw this post. However, it seems DiffBind will merge overlapping genes into a single peak, this is not what I want. I want still a count for each gene body. How to achieve that? I thought to get the counts for each gene by using Rsubread and then feed into DESeq2, but I do not see how DESeq2 uses the bFullLibrarySize (this information is from the bam file) Thanks!

ADD REPLY • link 5.7 years ago tangming2005 ▴ 190

0

Entering edit mode

If you want to use DESeq2, you can set the scaling factor directly using the number of reads in each library:

> libsize = as.integer(myDBA$class[8,])

and then set these instead of calling estimateSizeFactors():

> DESeq2::sizeFactors(myDESeq2) <- libsize/min(libsize)