I have been using DESeq2 to analyze a data set (which I can provide if you let me know how). There are three groups control (n=3), and two experimentals (n=6 and n=5 respectively). I am specifically interested in a subset of mitochondrial genes, so I extracted those genes from the output table, and looked at the fold changes. All of them are markedly higher than what I calculate by hand using average/average. I did the significance testing two ways, one was to use the built in p adjustment, and the other was to manual adjust raw pvalues for the total number of mitochondrial genes (13). I have read the vignette sections on the way DESeq2 handles fold change, but that would decrease fold change, not increase it as in this case. I also checked https://support.bioconductor.org/p/56876/ which has a similar issue, but they have genes with zero counts, whereas none of mine are zero. The pvalues adjusted for all genes and manually adjusted for mitochondrial genes only are shown, as are the log2FC calculated by DESeq2 and that calculated manually by me.
gene_name log2FC(DESeq) log2FC(manual) padj(all) padj(mito)
1 ATP6 1.1478026 0.53888743 0.12025253 0.030016423
2 ATP8 0.1296832 -0.25716892 0.99980974 0.002090115
3 COX1 0.6343189 0.0279667 0.82381609 0.602135571
4 COX2 0.9746684 0.27872888 0.24509383 0.04293689
5 COX3 1.0013343 0.04216672 0.23750685 1
6 CYTB 1.0769144 0.14293637 0.18836461 0.012845383
7 ND1 1.0584515 0.37491184 0.20257478 0.039408231
8 ND2 1.3333982 0.74555573 0.03517281 0.137780616
9 ND3 0.8630347 0.67474684 0.45660328 0.034468894
10 ND4 1.3008298 1.24767717 0.04523489 0.002897503
11 ND4L 1.0080011 0.69944278 0.22110758 0.005564325
12 ND5 1.2066828 0.99380552 0.06801963 0.000303793
13 ND6 1.3248817 0.87680785 0.01009348 0.026041558
I am happy to provide the full data set and full code if desired but here are the relevant sections
dds <- DESeqDataSetFromMatrix(countData = cts,colData = Groups,design = formula)
dds2<-DESeq(dds,minReplicatesForReplace=Inf)
ddsp<-estimateSizeFactors(dds)
ddspp<-counts(ddsp,normalized=T)
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] ggplot2_3.0.0 gridExtra_2.3
[3] stringr_1.3.1 reshape2_1.4.3
[5] DESeq2_1.14.1 SummarizedExperiment_1.4.0
[7] Biobase_2.34.0 GenomicRanges_1.26.4
[9] GenomeInfoDb_1.10.3 IRanges_2.8.2
[11] S4Vectors_0.12.2 BiocGenerics_0.20.0
loaded via a namespace (and not attached):
[1] locfit_1.5-9.1 Rcpp_0.12.15 lattice_0.20-34
[4] assertthat_0.2.0 digest_0.6.15 R6_2.3.0
[7] plyr_1.8.4 backports_1.1.2 acepack_1.4.1
[10] RSQLite_2.1.1 pillar_1.3.0 zlibbioc_1.20.0
[13] rlang_0.2.0 lazyeval_0.2.1 rstudioapi_0.8
[16] data.table_1.11.8 annotate_1.52.1 blob_1.1.1
[19] rpart_4.1-10 Matrix_1.2-7.1 checkmate_1.8.5
[22] splines_3.3.2 BiocParallel_1.8.2 geneplotter_1.52.0
[25] foreign_0.8-71 htmlwidgets_1.3 bit_1.1-14
[28] RCurl_1.95-4.11 munsell_0.5.0 pkgconfig_2.0.2
[31] base64enc_0.1-3 tcltk_3.3.2 htmltools_0.3.6
[34] nnet_7.3-12 tibble_1.4.2 htmlTable_1.12
[37] Hmisc_4.1-1 XML_3.98-1.16 withr_2.1.2
[40] crayon_1.3.4 dplyr_0.7.4 bitops_1.0-6
[43] grid_3.3.2 xtable_1.8-3 gtable_0.2.0
[46] DBI_1.0.0 magrittr_1.5 scales_1.0.0
[49] stringi_1.2.4 XVector_0.14.1 genefilter_1.56.0
[52] bindrcpp_0.2 latticeExtra_0.6-28 Formula_1.2-3
[55] RColorBrewer_1.1-2 tools_3.3.2 bit64_0.9-7
[58] glue_1.3.0 survival_2.42-3 AnnotationDbi_1.36.2
[61] colorspace_1.3-2 cluster_2.0.5 memoise_1.1.0
[64] knitr_1.20 bindr_0.1.1
Thanks James, I would just add that we don't even know the design as its not included above. As, James points out, the design is critical in fitting the coefficients, and only in a subset of designs is looking at ratios of average of norm. counts relevant.
The design is
~Group+vasoactives+postROSCepi+ROSChr+ROSCmap+ROSCetco2+ROSCsat+ROSC_periperbl
Thank you for your help Michael and James.
Sorry - just to add clarification, Group is a factor with 3 values, vasoactives and postROSCepi are factors, and the rest are continuous.
Right, so the LFC is not an average over an average for this design, even if you were doing simple linear regression on log counts.