Question

DESeq2 overestimates log2Fold Change

0

Entering edit mode

marcomhefti • 0

@marcomhefti-19885

Last seen 7.0 years ago

I have been using DESeq2 to analyze a data set (which I can provide if you let me know how). There are three groups control (n=3), and two experimentals (n=6 and n=5 respectively). I am specifically interested in a subset of mitochondrial genes, so I extracted those genes from the output table, and looked at the fold changes. All of them are markedly higher than what I calculate by hand using average/average. I did the significance testing two ways, one was to use the built in p adjustment, and the other was to manual adjust raw pvalues for the total number of mitochondrial genes (13). I have read the vignette sections on the way DESeq2 handles fold change, but that would decrease fold change, not increase it as in this case. I also checked https://support.bioconductor.org/p/56876/ which has a similar issue, but they have genes with zero counts, whereas none of mine are zero. The pvalues adjusted for all genes and manually adjusted for mitochondrial genes only are shown, as are the log2FC calculated by DESeq2 and that calculated manually by me.

   gene_name    log2FC(DESeq)   log2FC(manual)  padj(all)   padj(mito)
1   ATP6    1.1478026   0.53888743  0.12025253  0.030016423
2   ATP8    0.1296832   -0.25716892 0.99980974  0.002090115
3   COX1    0.6343189   0.0279667   0.82381609  0.602135571
4   COX2    0.9746684   0.27872888  0.24509383  0.04293689
5   COX3    1.0013343   0.04216672  0.23750685  1
6   CYTB    1.0769144   0.14293637  0.18836461  0.012845383
7   ND1 1.0584515   0.37491184  0.20257478  0.039408231
8   ND2 1.3333982   0.74555573  0.03517281  0.137780616
9   ND3 0.8630347   0.67474684  0.45660328  0.034468894
10  ND4 1.3008298   1.24767717  0.04523489  0.002897503
11  ND4L    1.0080011   0.69944278  0.22110758  0.005564325
12  ND5 1.2066828   0.99380552  0.06801963  0.000303793
13  ND6 1.3248817   0.87680785  0.01009348  0.026041558

I am happy to provide the full data set and full code if desired but here are the relevant sections

dds <- DESeqDataSetFromMatrix(countData = cts,colData = Groups,design = formula)
dds2<-DESeq(dds,minReplicatesForReplace=Inf)
ddsp<-estimateSizeFactors(dds)
ddspp<-counts(ddsp,normalized=T)

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] ggplot2_3.0.0              gridExtra_2.3
 [3] stringr_1.3.1              reshape2_1.4.3
 [5] DESeq2_1.14.1              SummarizedExperiment_1.4.0
 [7] Biobase_2.34.0             GenomicRanges_1.26.4
 [9] GenomeInfoDb_1.10.3        IRanges_2.8.2
[11] S4Vectors_0.12.2           BiocGenerics_0.20.0

loaded via a namespace (and not attached):
 [1] locfit_1.5-9.1       Rcpp_0.12.15         lattice_0.20-34
 [4] assertthat_0.2.0     digest_0.6.15        R6_2.3.0
 [7] plyr_1.8.4           backports_1.1.2      acepack_1.4.1
[10] RSQLite_2.1.1        pillar_1.3.0         zlibbioc_1.20.0
[13] rlang_0.2.0          lazyeval_0.2.1       rstudioapi_0.8
[16] data.table_1.11.8    annotate_1.52.1      blob_1.1.1
[19] rpart_4.1-10         Matrix_1.2-7.1       checkmate_1.8.5
[22] splines_3.3.2        BiocParallel_1.8.2   geneplotter_1.52.0
[25] foreign_0.8-71       htmlwidgets_1.3      bit_1.1-14
[28] RCurl_1.95-4.11      munsell_0.5.0        pkgconfig_2.0.2
[31] base64enc_0.1-3      tcltk_3.3.2          htmltools_0.3.6
[34] nnet_7.3-12          tibble_1.4.2         htmlTable_1.12
[37] Hmisc_4.1-1          XML_3.98-1.16        withr_2.1.2
[40] crayon_1.3.4         dplyr_0.7.4          bitops_1.0-6
[43] grid_3.3.2           xtable_1.8-3         gtable_0.2.0
[46] DBI_1.0.0            magrittr_1.5         scales_1.0.0
[49] stringi_1.2.4        XVector_0.14.1       genefilter_1.56.0
[52] bindrcpp_0.2         latticeExtra_0.6-28  Formula_1.2-3
[55] RColorBrewer_1.1-2   tools_3.3.2          bit64_0.9-7
[58] glue_1.3.0           survival_2.42-3      AnnotationDbi_1.36.2
[61] colorspace_1.3-2     cluster_2.0.5        memoise_1.1.0
[64] knitr_1.20           bindr_0.1.1

deseq2 • 966 views

ADD COMMENT • link updated 7.0 years ago by James W. MacDonald 68k • written 7.0 years ago by marcomhefti • 0

score 0 · Answer 1 · 2019-02-15

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 6 minutes ago

United States

The coefficients that are estimated by DESeq2 come from a GLM, which uses an interative procedure to estimate the coefficients because there is no closed-form solution. In addition, the GLM uses estimates from the library size as offsets, which you are probably not accounting for. Put a different way, you can't estimate the coefficients by hand to begin with, so by definition you can't estimate the log fold changes by hand either.

ADD COMMENT • link 7.0 years ago James W. MacDonald 68k

0

Entering edit mode

Thanks James, I would just add that we don't even know the design as its not included above. As, James points out, the design is critical in fitting the coefficients, and only in a subset of designs is looking at ratios of average of norm. counts relevant.

ADD REPLY • link 7.0 years ago Michael Love 43k

0

Entering edit mode

The design is

~Group+vasoactives+postROSCepi+ROSChr+ROSCmap+ROSCetco2+ROSCsat+ROSC_periperbl

Thank you for your help Michael and James.

ADD REPLY • link 7.0 years ago marcomhefti • 0

0

Entering edit mode

Sorry - just to add clarification, Group is a factor with 3 values, vasoactives and postROSCepi are factors, and the rest are continuous.

ADD REPLY • link 7.0 years ago marcomhefti • 0

0

Entering edit mode

Right, so the LFC is not an average over an average for this design, even if you were doing simple linear regression on log counts.

ADD REPLY • link 7.0 years ago Michael Love 43k