I'm currently having some issues with how
DESeq2 implements count outlier replacement when the number of samples suffices. From what I understand, the default is to use:
99% quantile of the F(p,m-p)
My specific questions are:
- What is the rational behind the default Cook's distance threshold for TMM replacement?
- What is the risk of lowering the threshold?
The reason I ask is that I'm wondering if the default setting is too stringent for my dataset. I have ~200 samples with a minimum of 12 samples per group. There's one particular gene which I expect to be positively associated with one of my groups. In my data, I see it as significantly differentially regulated
padj < .05, but it shows the opposite expected trend (down-regulated instead of up-regulated). I traced this back to a single count outlier in my control group. This can been seen in the ![attached plot](https://pitt-my.sharepoint.com/:b:/g/personal/del53_pitt_edu/EYoKB0_KHFJNk6ePSaAh-pABvz6Asu92nFrJ86fzukA6eQ?e=LJvYv3) , which shows box-plots of the raw and transformed counts. The Cook's distance for this particular outlier is ~4 which is below the threshold for replacement.
The code chunk below simulates the issues I described. Here,
gene1 has a single outlier in the
A group which is below the cooks and is therefore not replaced.
require(DESeq2) require(dplyr) set.seed(42) n_genes <- 500 m_samples <- 60 # Creating DDS with 3 groups dds <- makeExampleDESeqDataSet(n=n_genes, m = m_samples) colData(dds) <- DataFrame(condition=factor(c(rep('A', m_samples/4 - 3),rep('B', m_samples/4 + 3),rep('C', m_samples/2))),row.names = colnames(dds)) mcols(dds) <- NULL # not sure if neccessary # Creates NB counts with last group having highest mean # and first containing a single high count outlier counts(dds)[1,] <- as.integer(c(rnbinom(m_samples/2, 5, mu=50) + rnbinom(m_samples/2, 5, mu=25), rnbinom(m_samples/2, 1, mu=150) + rnbinom(m_samples/2, 1, mu=100))) counts(dds)[1, 1] <- as.integer(4*max(counts(dds)[1,])) counts(dds) <- matrix(sapply(counts(dds), as.integer), dim(counts(dds))) plotCounts(dds, gene = 'gene1') results(DESeq(dds))['gene1',]
R version 3.5.1 (2018-07-02) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS High Sierra 10.13.6 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib locale:  en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages:  parallel stats4 stats graphics grDevices utils datasets  methods base other attached packages:  reshape2_1.4.3 ggpubr_0.2  magrittr_1.5 ggplot2_3.1.0  bindrcpp_0.2.2 DESeq2_1.22.1  SummarizedExperiment_1.12.0 DelayedArray_0.8.0  BiocParallel_1.16.2 matrixStats_0.54.0  Biobase_2.42.0 GenomicRanges_1.34.0  GenomeInfoDb_1.18.1 IRanges_2.16.0  S4Vectors_0.20.1 BiocGenerics_0.28.0  dplyr_0.7.8 loaded via a namespace (and not attached):  bitops_1.0-6 bit64_0.9-7 doParallel_1.0.14  RColorBrewer_1.1-2 rprojroot_1.3-2 ggsci_2.9  tools_3.5.1 backports_1.1.2 utf8_1.1.4  R6_2.3.0 rpart_4.1-13 Hmisc_4.1-1  DBI_1.0.0 lazyeval_0.2.1 colorspace_1.3-2  nnet_7.3-12 withr_2.1.2 tidyselect_0.2.5  gridExtra_2.3 bit_1.1-14 compiler_3.5.1  cli_1.0.1 htmlTable_1.12 labeling_0.3  scales_1.0.0 checkmate_1.8.5 genefilter_1.64.0  stringr_1.3.1 digest_0.6.18 foreign_0.8-71  rmarkdown_1.10 XVector_0.22.0 base64enc_0.1-3  pkgconfig_2.0.2 basicOmics_0.1.0 htmltools_0.3.6  limma_3.38.3 htmlwidgets_1.3 rlang_0.3.0.1  rstudioapi_0.8 RSQLite_2.1.1 BiocInstaller_1.32.1  bindr_0.1.1 jsonlite_1.5 acepack_1.4.1  RCurl_1.95-4.11 GenomeInfoDbData_1.2.0 Formula_1.2-3  Matrix_1.2-15 Rcpp_1.0.0 munsell_0.5.0  fansi_0.4.0 stringi_1.2.4 yaml_2.2.0  zlibbioc_1.28.0 plyr_1.8.4 grid_3.5.1  blob_1.1.1 crayon_1.3.4 lattice_0.20-38  cowplot_0.9.3 splines_3.5.1 annotate_1.60.0  locfit_1.5-9.1 knitr_1.20 pillar_1.3.0  geneplotter_1.60.0 codetools_0.2-15 XML_3.98-1.16  glue_1.3.0 evaluate_0.12 latticeExtra_0.6-28  data.table_1.11.8 BiocManager_1.30.4 foreach_1.4.4  gtable_0.2.0 purrr_0.2.5 assertthat_0.2.0  xtable_1.8-3 ggcorrplot_0.1.2 survival_2.43-3  tibble_1.4.2 iterators_1.0.10 AnnotationDbi_1.44.0  memoise_1.1.0 cluster_2.0.7-1