DESeq2 overestimates log2Fold Change
1
0
Entering edit mode
@marcomhefti-19885
Last seen 5.1 years ago

I have been using DESeq2 to analyze a data set (which I can provide if you let me know how). There are three groups control (n=3), and two experimentals (n=6 and n=5 respectively). I am specifically interested in a subset of mitochondrial genes, so I extracted those genes from the output table, and looked at the fold changes. All of them are markedly higher than what I calculate by hand using average/average. I did the significance testing two ways, one was to use the built in p adjustment, and the other was to manual adjust raw pvalues for the total number of mitochondrial genes (13). I have read the vignette sections on the way DESeq2 handles fold change, but that would decrease fold change, not increase it as in this case. I also checked https://support.bioconductor.org/p/56876/ which has a similar issue, but they have genes with zero counts, whereas none of mine are zero. The pvalues adjusted for all genes and manually adjusted for mitochondrial genes only are shown, as are the log2FC calculated by DESeq2 and that calculated manually by me.

   gene_name    log2FC(DESeq)   log2FC(manual)  padj(all)   padj(mito)
1   ATP6    1.1478026   0.53888743  0.12025253  0.030016423
2   ATP8    0.1296832   -0.25716892 0.99980974  0.002090115
3   COX1    0.6343189   0.0279667   0.82381609  0.602135571
4   COX2    0.9746684   0.27872888  0.24509383  0.04293689
5   COX3    1.0013343   0.04216672  0.23750685  1
6   CYTB    1.0769144   0.14293637  0.18836461  0.012845383
7   ND1 1.0584515   0.37491184  0.20257478  0.039408231
8   ND2 1.3333982   0.74555573  0.03517281  0.137780616
9   ND3 0.8630347   0.67474684  0.45660328  0.034468894
10  ND4 1.3008298   1.24767717  0.04523489  0.002897503
11  ND4L    1.0080011   0.69944278  0.22110758  0.005564325
12  ND5 1.2066828   0.99380552  0.06801963  0.000303793
13  ND6 1.3248817   0.87680785  0.01009348  0.026041558

I am happy to provide the full data set and full code if desired but here are the relevant sections

dds <- DESeqDataSetFromMatrix(countData = cts,colData = Groups,design = formula)
dds2<-DESeq(dds,minReplicatesForReplace=Inf)
ddsp<-estimateSizeFactors(dds)
ddspp<-counts(ddsp,normalized=T)

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] ggplot2_3.0.0              gridExtra_2.3
 [3] stringr_1.3.1              reshape2_1.4.3
 [5] DESeq2_1.14.1              SummarizedExperiment_1.4.0
 [7] Biobase_2.34.0             GenomicRanges_1.26.4
 [9] GenomeInfoDb_1.10.3        IRanges_2.8.2
[11] S4Vectors_0.12.2           BiocGenerics_0.20.0

loaded via a namespace (and not attached):
 [1] locfit_1.5-9.1       Rcpp_0.12.15         lattice_0.20-34
 [4] assertthat_0.2.0     digest_0.6.15        R6_2.3.0
 [7] plyr_1.8.4           backports_1.1.2      acepack_1.4.1
[10] RSQLite_2.1.1        pillar_1.3.0         zlibbioc_1.20.0
[13] rlang_0.2.0          lazyeval_0.2.1       rstudioapi_0.8
[16] data.table_1.11.8    annotate_1.52.1      blob_1.1.1
[19] rpart_4.1-10         Matrix_1.2-7.1       checkmate_1.8.5
[22] splines_3.3.2        BiocParallel_1.8.2   geneplotter_1.52.0
[25] foreign_0.8-71       htmlwidgets_1.3      bit_1.1-14
[28] RCurl_1.95-4.11      munsell_0.5.0        pkgconfig_2.0.2
[31] base64enc_0.1-3      tcltk_3.3.2          htmltools_0.3.6
[34] nnet_7.3-12          tibble_1.4.2         htmlTable_1.12
[37] Hmisc_4.1-1          XML_3.98-1.16        withr_2.1.2
[40] crayon_1.3.4         dplyr_0.7.4          bitops_1.0-6
[43] grid_3.3.2           xtable_1.8-3         gtable_0.2.0
[46] DBI_1.0.0            magrittr_1.5         scales_1.0.0
[49] stringi_1.2.4        XVector_0.14.1       genefilter_1.56.0
[52] bindrcpp_0.2         latticeExtra_0.6-28  Formula_1.2-3
[55] RColorBrewer_1.1-2   tools_3.3.2          bit64_0.9-7
[58] glue_1.3.0           survival_2.42-3      AnnotationDbi_1.36.2
[61] colorspace_1.3-2     cluster_2.0.5        memoise_1.1.0
[64] knitr_1.20           bindr_0.1.1
deseq2 • 530 views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 9 hours ago
United States

The coefficients that are estimated by DESeq2 come from a GLM, which uses an interative procedure to estimate the coefficients because there is no closed-form solution. In addition, the GLM uses estimates from the library size as offsets, which you are probably not accounting for. Put a different way, you can't estimate the coefficients by hand to begin with, so by definition you can't estimate the log fold changes by hand either.

ADD COMMENT
0
Entering edit mode

Thanks James, I would just add that we don't even know the design as its not included above. As, James points out, the design is critical in fitting the coefficients, and only in a subset of designs is looking at ratios of average of norm. counts relevant.

ADD REPLY
0
Entering edit mode

The design is

~Group+vasoactives+postROSCepi+ROSChr+ROSCmap+ROSCetco2+ROSCsat+ROSC_periperbl

Thank you for your help Michael and James.

ADD REPLY
0
Entering edit mode

Sorry - just to add clarification, Group is a factor with 3 values, vasoactives and postROSCepi are factors, and the rest are continuous.

ADD REPLY
0
Entering edit mode

Right, so the LFC is not an average over an average for this design, even if you were doing simple linear regression on log counts.

ADD REPLY

Login before adding your answer.

Traffic: 972 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6