In this post I have described that DiffBind gives a different number of DE genes when using dba.analyze, and when returning the results as a DESeq2 object. It seems that I have found the reason. Underlying that is DiffBind (at least by default) not using DESeq2's MLE estimation when calculating (log) fold change. So the concentration of DiffBind equals precisely log2(mean across condition), while the DESeq2 log fold change does not.
> suppressPackageStartupMessages(library(DESeq2))
> suppressPackageStartupMessages(library(DiffBind))
> data(tamoxifen_counts)
> tamoxifen <- dba.contrast(tamoxifen, design="~Tissue")
Computing results names...
> tamoxifen <- dba.contrast(tamoxifen, group1=tamoxifen$masks$MCF7, group2=tamoxifen$masks$BT474, name1="MCF7", name2="BT474")
> tamoxifen <- dba.analyze(tamoxifen,method = DBA_DESEQ2)
Normalize DESeq2 with defaults...
Analyzing...
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
converting counts to integer mode
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
>
> dba.show(tamoxifen,bContrasts = T,th=0.1)
Group Samples Group2 Samples2 DB.DESeq2
1 MCF7 5 BT474 2 954
>
>
> dds <- dba.analyze(tamoxifen,bRetrieveAnalysis = DBA_DESEQ2)
>
> res <- results(dds, independentFiltering = T, contrast=c("Tissue","MCF7","BT474"))
>
> summary(res)
out of 2845 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 738, 26%
LFC < 0 (down) : 437, 15%
outliers [1] : 0, 0%
low counts [2] : 0, 0%
(mean count < 4)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
Above, we see that the number of DE genes is different. Here's why :
peak="1"
> dba.report(tamoxifen,th = 1,bNormalized = T,contrast=1,bCounts=T)[peak,]
GRanges object with 1 range and 13 metadata columns:
seqnames ranges strand | Conc Conc_MCF7 Conc_BT474 Fold p-value FDR MCF71
<Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
1 chr18 90841-91241 * | 0 0 0.88128 -0.88128 0.373987 1 0
MCF72 MCF73 MCF7r1 MCF7r2 BT4741 BT4742
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
1 0 0 0 0 2.47 1.21
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
> counts(dds,normalized=T)[peak,]
BT4741 BT4742 MCF71 MCF72 MCF73 T47D1 T47D2 MCF7r1 MCF7r2
2.4692546 1.2147633 0.0000000 0.0000000 0.0000000 0.0000000 0.5461769 0.0000000 0.0000000
ZR751 ZR752
25.0813852 10.9513965
DiffBind fold change = Conc_MCF7-Conc_BT474
ConcBT474 is precisely log2(mean(BT4741,BT4742)) = log2(mean(c(2.4692546,1.2147633)))
= 0.8812801
Which means that it is estimated without using MLE, as far as I understand. DESeq2 fold change is different :
> res[peak,]
log2 fold change (MLE): Tissue MCF7 vs BT474
Wald test p-value: Tissue MCF7 vs BT474
DataFrame with 1 row and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
1 3.66027 -2.52726 1.28297 -1.96985 0.0488553 0.114492
This relationship holds for all peaks. The log2 fold change that appears in dba.report is precisely log2(mean of condition), without MLE estimation. That is different from DESeq2 fold change, and thus the results are different.
What is the solution to that? Which of the results is the correct one to use?
> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_IL.UTF-8 LC_NUMERIC=C LC_TIME=en_IL.UTF-8
[4] LC_COLLATE=en_IL.UTF-8 LC_MONETARY=en_IL.UTF-8 LC_MESSAGES=en_IL.UTF-8
[7] LC_PAPER=en_IL.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_IL.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] DiffBind_3.0.13 DESeq2_1.30.1 SummarizedExperiment_1.20.0
[4] Biobase_2.50.0 MatrixGenerics_1.2.1 matrixStats_0.58.0
[7] GenomicRanges_1.42.0 GenomeInfoDb_1.26.2 IRanges_2.24.1
[10] S4Vectors_0.28.1 BiocGenerics_0.36.0
loaded via a namespace (and not attached):
[1] GOstats_2.56.0 backports_1.2.1 BiocFileCache_1.14.0 plyr_1.8.6
[5] GSEABase_1.52.1 splines_4.0.4 BiocParallel_1.24.1 ggplot2_3.3.3
[9] amap_0.8-18 digest_0.6.27 invgamma_1.1 htmltools_0.5.1.1
[13] GO.db_3.12.1 SQUAREM_2021.1 fansi_0.4.2 magrittr_2.0.1
[17] checkmate_2.0.0 memoise_2.0.0 BSgenome_1.58.0 base64url_1.4
[21] limma_3.46.0 Biostrings_2.58.0 annotate_1.68.0 systemPipeR_1.24.3
[25] bdsmatrix_1.3-4 askpass_1.1 prettyunits_1.1.1 jpeg_0.1-8.1
[29] colorspace_2.0-0 blob_1.2.1 rappdirs_0.3.3 apeglm_1.12.0
[33] ggrepel_0.9.1 xfun_0.21 dplyr_1.0.4 crayon_1.4.1
[37] RCurl_1.98-1.2 jsonlite_1.7.2 graph_1.68.0 genefilter_1.72.1
[41] brew_1.0-6 survival_3.2-7 VariantAnnotation_1.36.0 glue_1.4.2
[45] gtable_0.3.0 zlibbioc_1.36.0 XVector_0.30.0 DelayedArray_0.16.1
[49] V8_3.4.0 Rgraphviz_2.34.0 scales_1.1.1 mvtnorm_1.1-1
[53] pheatmap_1.0.12 DBI_1.1.1 edgeR_3.32.1 Rcpp_1.0.6
[57] xtable_1.8-4 progress_1.2.2 emdbook_1.3.12 bit_4.0.4
[61] rsvg_2.1 truncnorm_1.0-8 AnnotationForge_1.32.0 httr_1.4.2
[65] gplots_3.1.1 RColorBrewer_1.1-2 ellipsis_0.3.1 pkgconfig_2.0.3
[69] XML_3.99-0.5 dbplyr_2.1.0 locfit_1.5-9.4 utf8_1.1.4
[73] tidyselect_1.1.0 rlang_0.4.10 AnnotationDbi_1.52.0 munsell_0.5.0
[77] tools_4.0.4 cachem_1.0.4 generics_0.1.0 RSQLite_2.2.3
[81] evaluate_0.14 stringr_1.4.0 fastmap_1.1.0 yaml_2.2.1
[85] knitr_1.31 bit64_4.0.5 caTools_1.18.1 purrr_0.3.4
[89] RBGL_1.66.0 xml2_1.3.2 biomaRt_2.46.3 compiler_4.0.4
[93] rstudioapi_0.13 curl_4.3 png_0.1-7 tibble_3.0.6
[97] geneplotter_1.68.0 stringi_1.5.3 GenomicFeatures_1.42.1 lattice_0.20-41
[101] Matrix_1.3-2 vctrs_0.3.6 pillar_1.5.0 lifecycle_1.0.0
[105] irlba_2.3.3 data.table_1.14.0 bitops_1.0-6 rtracklayer_1.50.0
[109] R6_2.5.0 latticeExtra_0.6-29 hwriter_1.3.2 ShortRead_1.48.0
[113] KernSmooth_2.23-18 MASS_7.3-53.1 gtools_3.8.2 assertthat_0.2.1
[117] openssl_1.4.3 Category_2.56.0 rjson_0.2.20 withr_2.4.1
[121] GenomicAlignments_1.26.0 batchtools_0.9.15 Rsamtools_2.6.0 GenomeInfoDbData_1.2.4
[125] hms_1.0.0 grid_4.0.4 DOT_0.1 coda_0.19-4
[129] rmarkdown_2.7 GreyListChIP_1.22.0 ashr_2.2-47 mixsqp_0.3-43
[133] bbmle_1.0.23.1 numDeriv_2016.8-1.1 tinytex_0.29
Hi Rory Stark ,
I wanted to discuss the "DESeq2::lfcShrink()" function mentioned in your previous explanation. While examining the source code of the latest version of Diffbind, I came across the "shrink" function, which contains the following relevant code snippets:
I have a few questions regarding this implementation. Firstly, when should I use "byname" or "results1" for the contrastType parameter? Secondly, I noticed that Diffbind employs two shrinkage methods, namely "apeglm" and "ashr." Could you please provide further explanation on this matter?
Thanks!
Joe
This is a repeat and answered here: Question regarding lfcShrink types in DiffBind