Question: DESeq2 overestimates log2Fold Change
0
4 weeks ago by
marcomhefti0 wrote:

I have been using DESeq2 to analyze a data set (which I can provide if you let me know how). There are three groups control (n=3), and two experimentals (n=6 and n=5 respectively). I am specifically interested in a subset of mitochondrial genes, so I extracted those genes from the output table, and looked at the fold changes. All of them are markedly higher than what I calculate by hand using average/average. I did the significance testing two ways, one was to use the built in p adjustment, and the other was to manual adjust raw pvalues for the total number of mitochondrial genes (13). I have read the vignette sections on the way DESeq2 handles fold change, but that would decrease fold change, not increase it as in this case. I also checked https://support.bioconductor.org/p/56876/ which has a similar issue, but they have genes with zero counts, whereas none of mine are zero. The pvalues adjusted for all genes and manually adjusted for mitochondrial genes only are shown, as are the log2FC calculated by DESeq2 and that calculated manually by me.

   gene_name    log2FC(DESeq)   log2FC(manual)  padj(all)   padj(mito)
1   ATP6    1.1478026   0.53888743  0.12025253  0.030016423
2   ATP8    0.1296832   -0.25716892 0.99980974  0.002090115
3   COX1    0.6343189   0.0279667   0.82381609  0.602135571
4   COX2    0.9746684   0.27872888  0.24509383  0.04293689
5   COX3    1.0013343   0.04216672  0.23750685  1
6   CYTB    1.0769144   0.14293637  0.18836461  0.012845383
7   ND1 1.0584515   0.37491184  0.20257478  0.039408231
8   ND2 1.3333982   0.74555573  0.03517281  0.137780616
9   ND3 0.8630347   0.67474684  0.45660328  0.034468894
10  ND4 1.3008298   1.24767717  0.04523489  0.002897503
11  ND4L    1.0080011   0.69944278  0.22110758  0.005564325
12  ND5 1.2066828   0.99380552  0.06801963  0.000303793
13  ND6 1.3248817   0.87680785  0.01009348  0.026041558


I am happy to provide the full data set and full code if desired but here are the relevant sections

dds <- DESeqDataSetFromMatrix(countData = cts,colData = Groups,design = formula)
dds2<-DESeq(dds,minReplicatesForReplace=Inf)
ddsp<-estimateSizeFactors(dds)
ddspp<-counts(ddsp,normalized=T)

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] ggplot2_3.0.0              gridExtra_2.3
[3] stringr_1.3.1              reshape2_1.4.3
[5] DESeq2_1.14.1              SummarizedExperiment_1.4.0
[7] Biobase_2.34.0             GenomicRanges_1.26.4
[9] GenomeInfoDb_1.10.3        IRanges_2.8.2
[11] S4Vectors_0.12.2           BiocGenerics_0.20.0

loaded via a namespace (and not attached):
[1] locfit_1.5-9.1       Rcpp_0.12.15         lattice_0.20-34
[4] assertthat_0.2.0     digest_0.6.15        R6_2.3.0
[7] plyr_1.8.4           backports_1.1.2      acepack_1.4.1
[10] RSQLite_2.1.1        pillar_1.3.0         zlibbioc_1.20.0
[13] rlang_0.2.0          lazyeval_0.2.1       rstudioapi_0.8
[16] data.table_1.11.8    annotate_1.52.1      blob_1.1.1
[19] rpart_4.1-10         Matrix_1.2-7.1       checkmate_1.8.5
[22] splines_3.3.2        BiocParallel_1.8.2   geneplotter_1.52.0
[25] foreign_0.8-71       htmlwidgets_1.3      bit_1.1-14
[28] RCurl_1.95-4.11      munsell_0.5.0        pkgconfig_2.0.2
[31] base64enc_0.1-3      tcltk_3.3.2          htmltools_0.3.6
[34] nnet_7.3-12          tibble_1.4.2         htmlTable_1.12
[37] Hmisc_4.1-1          XML_3.98-1.16        withr_2.1.2
[40] crayon_1.3.4         dplyr_0.7.4          bitops_1.0-6
[43] grid_3.3.2           xtable_1.8-3         gtable_0.2.0
[46] DBI_1.0.0            magrittr_1.5         scales_1.0.0
[49] stringi_1.2.4        XVector_0.14.1       genefilter_1.56.0
[52] bindrcpp_0.2         latticeExtra_0.6-28  Formula_1.2-3
[55] RColorBrewer_1.1-2   tools_3.3.2          bit64_0.9-7
[58] glue_1.3.0           survival_2.42-3      AnnotationDbi_1.36.2
[61] colorspace_1.3-2     cluster_2.0.5        memoise_1.1.0
[64] knitr_1.20           bindr_0.1.1

deseq2 • 79 views
modified 4 weeks ago by James W. MacDonald49k • written 4 weeks ago by marcomhefti0
0
4 weeks ago by
United States
James W. MacDonald49k wrote:

The coefficients that are estimated by DESeq2 come from a GLM, which uses an interative procedure to estimate the coefficients because there is no closed-form solution. In addition, the GLM uses estimates from the library size as offsets, which you are probably not accounting for. Put a different way, you can't estimate the coefficients by hand to begin with, so by definition you can't estimate the log fold changes by hand either.

Thanks James, I would just add that we don't even know the design as its not included above. As, James points out, the design is critical in fitting the coefficients, and only in a subset of designs is looking at ratios of average of norm. counts relevant.

The design is

~Group+vasoactives+postROSCepi+ROSChr+ROSCmap+ROSCetco2+ROSCsat+ROSC_periperbl

Thank you for your help Michael and James.

Sorry - just to add clarification, Group is a factor with 3 values, vasoactives and postROSCepi are factors, and the rest are continuous.