Hi,
One of the many changes brought by DiffBind version 3.0 is the possibility of setting a model design in the same way one does in DESEq2. However, when I compare the results of this new approach ("design") with those of the pre-version 3.0 analysis ("no-design"), I see some differences. I used the following commands:
diffobject <- dba(sampleSheet=samplesheet,minOverlap=2, bRemoveRandom=T,scoreCol=9,filter=10,bRemoveRandom=T)
diffobject$config$doBlacklist <- F
diffobject$config$doGreylist <- F
diffobject <- dba.count(diffobject,bRemoveDuplicates=T,minCount=0,minOverlap=2, score=DBA_SCORE_NORMALIZED,mapQCth = 20,bParallel = T)
diffobject <- dba.normalize(diffobject,normalize=DBA_NORM_DEFAULT, library=DBA_LIBSIZE_FULL,background = F)
I tried to normalize using the same settings as in the pre-3.00 version according to the vignette. This is also the reason I set greylist and blacklist to FALSE
and why I set bSubControl to T. Next, I defined the contrasts as follows:
# Design approach (3.X)
diffobject.3 <- dba.contrast(diffobject,design="~Condition", contrast=c("Condition","TREAT","CONTROL"),reorderMeta = list(Condition="CONTROL"))
diffobject.3 <- dba.analyze(diffobject.3,design=T)
# No-design approach (pre-3.00)
diffobject.2 <- dba.contrast(diffobject,categories="Condition",group1=dba.mask(diffobject,"Condition","TREAT"),
group2=dba.mask(diffobject,"Condition","CONTROL"),name1="TREAT",name2="CONTROL")
diffobject.2 <- dba.normalize(diffobject.2,normalize=DBA_NORM_DEFAULT, library=DBA_LIBSIZE_FULL,background = F)
diffobject.2 <- dba.analyze(diffobject.2)
These are the contrasts with each approach:
# diffbind.2 (no-design)
1 Contrast:
Group Samples Group2 Samples2 DB.DESeq2
1 TREAT 4 CONTROL 3 11496
# diffbind.3 (design)
Design: [~Condition] | 1 Contrast:
Factor Group Samples Group2 Samples2 DB.DESeq2
1 Condition TREAT 4 CONTROL 3 12283
Of note, when I ran dba.analyze
on diffobject.3
with design=F
I got the exact same results as in diffobject.2
, in which groups had been set manually in dba.contrast
.
These are the main differences between the design vs no-design approaches, an example follows:
- Different number of regions with differential binding: 11496 vs 12283
- Different p-values for the same regions, probably related to the above
- Differences in fold change: In the no-design approach I obtain fold changes that are the result of subtracting group 1 - group 2, whereas in the design approach I get numbers that are a few decimals higher or lower. In the example below, you would expect a Fold of -9.12193 ( 2.28187-11.4038), but only the analysis with the "no-design" approach returns that.
# No design
dba.report(diffobject.3)
seqnames ranges strand | Conc Conc_TREAT Conc_CONTROL Fold p-value FDR
<Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
48116 chr7 27195628-27196028 * | 10.18484 2.28187 11.4038 -8.91646 1.20932e-101 6.67834e-97
# Design
dba.report(diffobject.2)
seqnames ranges strand | Conc Conc_TREAT Conc_CONTROL Fold p-value FDR
<Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
48116 chr7 27195628-27196028 * | 10.18484 2.28187 11.4038 -9.12192 1.83179e-101 1.01095e-96
Moreover, in the "design" approach there are some extreme cases with a Fold change that is not only very different from (Conc_1 - Conc_2), but in fact extremely large. As a result, the MA plot looks "flattened" due to the inclusion of these extreme values.
I have also run DESeq2 separately, after extracting the raw counts from the DiffBind object using dba.count(diffobject,peaks=NULL,score=DBA_SCORE_READS_MINUS)
. The results look quite similar to the "design" approach, including those extreme values I was talking about. Skimming through the DiffBind code, what I believe is going on here is that in the "design" approach the fold change is taken directly from DESeq2, whereas in the "no-design" mode, the fold is always calculated as Conc_1 - Conc_2. Since Conc values < 0 are converted to 0, that avoids extreme differences.
Still, that does not explain why there are also differences in the p-values and the total number of differential regions. Are there any parameters that I should adjust differently? It have been using the "no design" approach for a while and I would like new results to be comparable. Furthermore, it would be ideal if Fold change was presented in the same manner in both cases.
P.S. I have also noticed that the vignette states that bSubControl
is FALSE by default unless a greylist has been specified, but the dba.count
states the opposite. Looking at the function, it looks like the vignette is wrong: bSubControl = is.null(DBA$greylist)
.