Search
Question: DiffBind differential binding normalization with different levels of binding
0
2.7 years ago by
igor20
United States
igor20 wrote:

I have ChIP-seq with two conditions. In condition A, there should a lot of peaks (normal binding). In condition B, there should be none (no binding). Calling peaks with MACS confirms this. When you perform differential binding analysis, you would expect most of the significant peaks to be up in condition A. However, I get half the peaks up and half down. That does not make sense.

I am using DiffBind. My first instinct was that the peaks were being normalized, so in condition B (where there are few reads in the peak regions), those few reads get normalized to a much higher level. However, that should not be the case according to the manual:

When dba.analyze is invoked using the default method=DBA_EDGER, a standardized differential analysis is performed using the edgeR package. ... First, a matrix of counts is constructed for the contrast, with columns for all the samples in the first group, followed by columns for all the samples in the second group. The raw read count is used for this matrix; if the bSubControl parameter is set to TRUE (as it is by default), the raw number of reads in the control sample (if available) will be subtracted (with a minimum final read count of 1). Next the library size is computed for each sample for use in subsequent normalization. By default, this is the total number of reads in the library (calculated from the source BAM//BED file). Alternatively, if the bFullLibrarySize parameter is set to FALSE,the total number of reads in peaks (the sum of each column) is used.

That disproves my initial theory. Is there another explanation?

modified 2.7 years ago by Rory Stark2.5k • written 2.7 years ago by igor20
1
2.7 years ago by
Rory Stark2.5k
CRUK, Cambridge, UK
Rory Stark2.5k wrote:

Hello Igor-

Actually the problem is, indeed, normalization. When using the edgeR package, the default TMM normalization is used. Like most normalization methods, this relies on an assumption that there is a core set of peaks that are similarly enriched in both groups.

You can see the effects of the normalization after performing an analysis by comparing the MA plots with and without normalization:

> par(mfrow=c(2,1)
> dba.plotMA(myDBA, contrast=1)
> dba.plotMA(myDBA,contrast=1, bNormalized=FALSE)

We have noticed that in the most recent releases, the issue you are encountering has gotten more pronounced in experiments where one group has little or no binding. I'm looking into this (I suspect it relates to something that changed in edgeR). We have found that in these cases the normalization used by DESeq2 is less severe and may be more appropriate in this case. You can run both analyses and compare them:

> myDBA <- dba.analyze(myDBA, method=c(DBA_EDGER, DBA_DESEQ2) > par(mfrow=c(2,2) > dba.plotMA(myDBA, contrast=1, method=DBA_EDGER, bNormalized=FALSE) > dba.plotMA(myDBA, contrast=1, method=DBA_EDGER) > dba.plotMA(myDBA, contrast=1, method=DBA_DESEQ2,bNormalized=FALSE) > dba.plotMA(myDBA, contrast=1, method=DBA_DESEQ2)

This issue is one of great interest to us and we are looking at alternatives, such as performing only the most basic normalization (for library depth) in these cases.

Cheers-
Rory

If I run dba.plotMA with and without bNormalized and see the difference, should I also run dba.report with bNormalized=F? Is that valid?

Also, why isn't bFullLibrarySize sufficient? If I use bNormalized=F, wouldn't that ignore sequencing depth differences (which are real)?

The plots worked. The methods look different as would be expected. However, bNormalized does not seem to have any impact. I explicitly tried setting it to both T and F and the results were the same for both methods.

bNormalized is only used to get some visual sense of what the normalization is doing (check out the y-axis scales esp), it doesn't change anything in the analysis.

1
2.7 years ago by
Rory Stark2.5k
CRUK, Cambridge, UK
Rory Stark2.5k wrote:

The bNormalized parameter to dba.plotMA() and dba.report() has no impact in how the analysis is carried out, it just uses non-normalized counts for the plot and/or report to show some underlying data. There is currently no way with DiffBind to perform an analysis without normalizing (using a normalization method intrinsic to the underlying differential analysis package)..

The TMM normalization used in edgeR is more complicated than just taking into account sequencing depth. For example, if one sample group has a set of enriched peaks A, and the other sample group has the same enriched peaks A plus gains an additional set of enriched peaks B, and they are all sequenced to the same depth, the same number of reads have to be divided between more peaks in the second group. Without normalizing, it will appear that the A peaks will be weaker in the second group when the truth is that the binding affinity hasn't changed. So we need to normalize for more than just library depth. (Another example is differing ChIP efficiencies.)

I encourage you to read the well-written TMM paper by Robinson and Oshlack.

Have you tried the DESeq2 method? I'm curious to see how you results differ between the two methods.

-R

Switching from edgeR to DESeq2 led to a significant improvement:

• edgeR - 4,200 up, 4,600 down
• DESeq2 - 300 up, 13,100 down

However, since one of the conditions has no binding (at least according to MACS), is there any way to get even more of an improvement?

Thanks for posting this info, it corresponds what I am seeing on similar projects.

At this point, DiffBind will only perform analyses with full normalization. In this case, it may be a good idea to either a) do the analysis at a finer granularity by using the edgeR and/or DESeq2 packages directly, and/or b) try another package. For this problem I suggest trying the csaw package. This package does not rely on MACS peaks -- it uses a windowing approach to identify deferentially enriched regions --  and while it also uses edgeR for normalization and analysis, it has a twist on the normalization procedure (using "background" regions) that may be more appropriate. I'd be very interested to know how a csaw analysis works on this dataset.

Cheers-

Rory

I have not tried csaw yet. However, I had one more followup question about normalization. Switching between edgeR and DESeq2 has big effects on the differential binding results. However, it has not effect on dba.plotPCA results. I thought that was using the normalized data. I am running it after dba.contrast but before dba.analyze. On a related note, doesn't the normalization happen at the dba.analyze step? Are there multiple normalization steps?

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by igor20

dba.plotPCA() uses one of two sets of values depending on whether you are doing a "global" plot or a plot specific to a contrast after calling dba.analyze(). If you do not specify a contrast, it uses the read scores (which default to TMM normalized reads from edgeR). If you do specify a contrast, it uses the normalised values specific to the analysis method. Note that the default method is edgeR, so if you want to see the DESeq2 normalized scores, you need to specify contrast= and method=DBA_DESEQ2.

I was doing a global plot (not specifying a contrast). So is it not possible to do a global plot based on DESeq2 normalization? Can I just set th=1 to trick it to use all peaks or would that have additional consequences?

The th parameter refers to a FDR threshold, and hence only applies to an analysis (where a contrast is specified). So that doesn't help. Currently, while DiffBind can use 16 different types of score values for global plots (see man page for dba.count()), these don't include any DESeq2 normalization scores.

When I want to plot using DESeq2 normalization scores, I just make a contrast with all the samples, do a DESeq2 analysis, then plot that contrast and set th=1 as you suggest.

Adding DESeq2 normalization as a global score is probably a good feature idea for a future release.

-R

Is there a way to make a contrast with all the samples without starting from a new sample sheet?

Would you advise to use score=DBA_SCORE_RPKM if I would like to avoid any normalization biases (relevant in context of my original question where normalization makes a major difference)?

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by igor20

RPKM is pretty good for the clustering/PCA plots. You can change the score without having to recount as follows:

> myDBA <- dba.count(myDBA, peaks=NULL, score=DBA_SCORE_RPKM)

I also use DBA_SCORE_RPKM_FOLD sometimes for these plots.

To make a contrast with all the samples you can pick a few arbitrarily for one group and all the others in the other group, or use  a metadata field where with exactly two values.

Example of creating an arbitrary contrast (supposing you have six samples):

> myDBA <- dba,contrast(myDBA,group1=1:3,group2=4:6)
> myDBA <- dba.analyze(myDBA, method=DBA_DESEQ2)

Example using a metadata field. For example if Condition had values WT and KO:

> myDBA <- dba.contrast(myDBA, categories=DBA_CONDITION)
> myDBA <- dba.analyze(myDBA, method=DBA_DESEQ2)

-R