Question: DESeq2 overestimates log2Fold Change
gravatar for marcomhefti
3 months ago by
marcomhefti0 wrote:

I have been using DESeq2 to analyze a data set (which I can provide if you let me know how). There are three groups control (n=3), and two experimentals (n=6 and n=5 respectively). I am specifically interested in a subset of mitochondrial genes, so I extracted those genes from the output table, and looked at the fold changes. All of them are markedly higher than what I calculate by hand using average/average. I did the significance testing two ways, one was to use the built in p adjustment, and the other was to manual adjust raw pvalues for the total number of mitochondrial genes (13). I have read the vignette sections on the way DESeq2 handles fold change, but that would decrease fold change, not increase it as in this case. I also checked which has a similar issue, but they have genes with zero counts, whereas none of mine are zero. The pvalues adjusted for all genes and manually adjusted for mitochondrial genes only are shown, as are the log2FC calculated by DESeq2 and that calculated manually by me.

   gene_name    log2FC(DESeq)   log2FC(manual)  padj(all)   padj(mito)
1   ATP6    1.1478026   0.53888743  0.12025253  0.030016423
2   ATP8    0.1296832   -0.25716892 0.99980974  0.002090115
3   COX1    0.6343189   0.0279667   0.82381609  0.602135571
4   COX2    0.9746684   0.27872888  0.24509383  0.04293689
5   COX3    1.0013343   0.04216672  0.23750685  1
6   CYTB    1.0769144   0.14293637  0.18836461  0.012845383
7   ND1 1.0584515   0.37491184  0.20257478  0.039408231
8   ND2 1.3333982   0.74555573  0.03517281  0.137780616
9   ND3 0.8630347   0.67474684  0.45660328  0.034468894
10  ND4 1.3008298   1.24767717  0.04523489  0.002897503
11  ND4L    1.0080011   0.69944278  0.22110758  0.005564325
12  ND5 1.2066828   0.99380552  0.06801963  0.000303793
13  ND6 1.3248817   0.87680785  0.01009348  0.026041558

I am happy to provide the full data set and full code if desired but here are the relevant sections

dds <- DESeqDataSetFromMatrix(countData = cts,colData = Groups,design = formula)

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] ggplot2_3.0.0              gridExtra_2.3
 [3] stringr_1.3.1              reshape2_1.4.3
 [5] DESeq2_1.14.1              SummarizedExperiment_1.4.0
 [7] Biobase_2.34.0             GenomicRanges_1.26.4
 [9] GenomeInfoDb_1.10.3        IRanges_2.8.2
[11] S4Vectors_0.12.2           BiocGenerics_0.20.0

loaded via a namespace (and not attached):
 [1] locfit_1.5-9.1       Rcpp_0.12.15         lattice_0.20-34
 [4] assertthat_0.2.0     digest_0.6.15        R6_2.3.0
 [7] plyr_1.8.4           backports_1.1.2      acepack_1.4.1
[10] RSQLite_2.1.1        pillar_1.3.0         zlibbioc_1.20.0
[13] rlang_0.2.0          lazyeval_0.2.1       rstudioapi_0.8
[16] data.table_1.11.8    annotate_1.52.1      blob_1.1.1
[19] rpart_4.1-10         Matrix_1.2-7.1       checkmate_1.8.5
[22] splines_3.3.2        BiocParallel_1.8.2   geneplotter_1.52.0
[25] foreign_0.8-71       htmlwidgets_1.3      bit_1.1-14
[28] RCurl_1.95-4.11      munsell_0.5.0        pkgconfig_2.0.2
[31] base64enc_0.1-3      tcltk_3.3.2          htmltools_0.3.6
[34] nnet_7.3-12          tibble_1.4.2         htmlTable_1.12
[37] Hmisc_4.1-1          XML_3.98-1.16        withr_2.1.2
[40] crayon_1.3.4         dplyr_0.7.4          bitops_1.0-6
[43] grid_3.3.2           xtable_1.8-3         gtable_0.2.0
[46] DBI_1.0.0            magrittr_1.5         scales_1.0.0
[49] stringi_1.2.4        XVector_0.14.1       genefilter_1.56.0
[52] bindrcpp_0.2         latticeExtra_0.6-28  Formula_1.2-3
[55] RColorBrewer_1.1-2   tools_3.3.2          bit64_0.9-7
[58] glue_1.3.0           survival_2.42-3      AnnotationDbi_1.36.2
[61] colorspace_1.3-2     cluster_2.0.5        memoise_1.1.0
[64] knitr_1.20           bindr_0.1.1
deseq2 • 108 views
ADD COMMENTlink modified 3 months ago by James W. MacDonald50k • written 3 months ago by marcomhefti0
Answer: DESeq2 overestimates log2Fold Change
gravatar for James W. MacDonald
3 months ago by
United States
James W. MacDonald50k wrote:

The coefficients that are estimated by DESeq2 come from a GLM, which uses an interative procedure to estimate the coefficients because there is no closed-form solution. In addition, the GLM uses estimates from the library size as offsets, which you are probably not accounting for. Put a different way, you can't estimate the coefficients by hand to begin with, so by definition you can't estimate the log fold changes by hand either.

ADD COMMENTlink written 3 months ago by James W. MacDonald50k

Thanks James, I would just add that we don't even know the design as its not included above. As, James points out, the design is critical in fitting the coefficients, and only in a subset of designs is looking at ratios of average of norm. counts relevant.

ADD REPLYlink written 3 months ago by Michael Love23k

The design is


Thank you for your help Michael and James.

ADD REPLYlink written 3 months ago by marcomhefti0

Sorry - just to add clarification, Group is a factor with 3 values, vasoactives and postROSCepi are factors, and the rest are continuous.

ADD REPLYlink written 3 months ago by marcomhefti0

Right, so the LFC is not an average over an average for this design, even if you were doing simple linear regression on log counts.

ADD REPLYlink written 3 months ago by Michael Love23k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 312 users visited in the last hour