I have ChIP-seq with two conditions. In condition A, there should a lot of peaks (normal binding). In condition B, there should be none (no binding). Calling peaks with MACS confirms this. When you perform differential binding analysis, you would expect most of the significant peaks to be up in condition A. However, I get half the peaks up and half down. That does not make sense.
I am using DiffBind. My first instinct was that the peaks were being normalized, so in condition B (where there are few reads in the peak regions), those few reads get normalized to a much higher level. However, that should not be the case according to the manual:
When dba.analyze is invoked using the default method=DBA_EDGER, a standardized differential analysis is performed using the edgeR package. ... First, a matrix of counts is constructed for the contrast, with columns for all the samples in the first group, followed by columns for all the samples in the second group. The raw read count is used for this matrix; if the bSubControl parameter is set to TRUE (as it is by default), the raw number of reads in the control sample (if available) will be subtracted (with a minimum final read count of 1). Next the library size is computed for each sample for use in subsequent normalization. By default, this is the total number of reads in the library (calculated from the source BAM//BED file). Alternatively, if the bFullLibrarySize parameter is set to FALSE,the total number of reads in peaks (the sum of each column) is used.
That disproves my initial theory. Is there another explanation?
If I run
dba.plotMA
with and withoutbNormalized
and see the difference, should I also rundba.report
withbNormalized=F
? Is that valid?Also, why isn't
bFullLibrarySize
sufficient? If I usebNormalized=F
, wouldn't that ignore sequencing depth differences (which are real)?The plots worked. The methods look different as would be expected. However,
bNormalized
does not seem to have any impact. I explicitly tried setting it to both T and F and the results were the same for both methods.bNormalized
is only used to get some visual sense of what the normalization is doing (check out the y-axis scales esp), it doesn't change anything in the analysis.Hi Rory, a related question. I am using diffbind for H3K36me3 in 2 control vs 2 KO. we expect to see global H3K36me3 in KO samples.
because H3K36me3 is enriched in the gene body (peak calling does not work well for those diffused marks). I use the gene body as the peak regions for all samples. I want to use DBA_DESEQ2 and bFullLibrarySize = TRUE as I saw this post. However, it seems DiffBind will merge overlapping genes into a single peak, this is not what I want. I want still a count for each gene body. How to achieve that? I thought to get the counts for each gene by using Rsubread and then feed into DESeq2, but I do not see how DESeq2 uses the bFullLibrarySize (this information is from the bam file) Thanks!
If you want to use DESeq2, you can set the scaling factor directly using the number of reads in each library:
and then set these instead of calling
estimateSizeFactors()
:Thanks! this helps me a lot.