In the process of analyzing ChIP-Seq replicates with DiffBind (drosophila data, 3 replicates for each sample, choosing DESeq2 as the analysis method), peaks with very small Fold-change were found to be statistically significant.
I have plotted an MA plot; the red points are the significant ones, the black points are the non-significant ones. I have plotted the non-significant points after plotting the red points, so there was peak with a large fold that would be non-significant, it would be easily visible.
Zooming in, ylim = c(-2,2)
I wonder, does the statistical significace make sense? I am not used to such plots when using DESeq2 with RNA-Seq data. It is true that the number of elements to correct for multiple hypothesis is much smaller here (2000 peaks vs ~16K genes in DESeq2 usually). Nevertheless, I'd like to ask whether you agree that it's surprising/strange that lots of peaks with a very small change became significant. There are no non-significant peaks above a certain, small threshold.
<h2>Note: The values DiffBind returns, and are plotted, are shrunk fold changes.</h2>More on the data:
The median absolute value of the fold change among the significant peaks is 1.55 and the mean is 1.93 (not in log scale). The Concentration is also not that high, the median among the significant values being 225 reads.
The peak which has the smallest fold change, and is still significant is (Fold and Conc are in log2 scale here):
seqnames start end width strand
3L 11657351 11659235 1885 *
Conc Conc_Mut Conc_WT Fold p.value FDR
8.48 8.63 8.3 0.23 0.00655 0.0376
Individual counts (not in log2 scale):
S7 S8 S9 S1 S2 S3
404.42 391 396.29 305.76 317.14 321.53
P.S. I have posted this also on BioStars, but the answers posted there had to do with biological significance of the peaks with small fold changes. I am suspicious about the statistical significance of the peaks.
Looking at the individual counts, I noticed that S1-S3 are rather tightly spread around 315, and S7-9 around 395. These are indeed the different groups. So, I think that the answer is simple - the repeatability of the samples is rather high. Sorry for the hassle.
From Version 3.3.2 onwards,
DiffBind
will recompute the statistics if you specify a fold threshold indba.report()
,dba.plotMA()
, ordba.plotVolcano()
.For
DESeq2
analysis, it uses thelfcThreshold
parameter inDESeq2::results()
(re-shrinking fold changes), while foredgeR
analysis,edgeR::glmTreat()
(instead ofedgeR::glmQLFTest()
).