Question: DESeq2 count outlier replacement from Cook's threshold
0
7 months ago by
lefeverde10
lefeverde10 wrote:

I'm currently having some issues with how DESeq2 implements count outlier replacement when the number of samples suffices. From what I understand, the default is to use:

99% quantile of the F(p,m-p)

My specific questions are:

1. What is the rational behind the default Cook's distance threshold for TMM replacement?
2. What is the risk of lowering the threshold?

The reason I ask is that I'm wondering if the default setting is too stringent for my dataset. I have ~200 samples with a minimum of 12 samples per group. There's one particular gene which I expect to be positively associated with one of my groups. In my data, I see it as significantly differentially regulated padj < .05, but it shows the opposite expected trend (down-regulated instead of up-regulated). I traced this back to a single count outlier in my control group. This can been seen in the ![attached plot](https://pitt-my.sharepoint.com/:b:/g/personal/del53_pitt_edu/EYoKB0_KHFJNk6ePSaAh-pABvz6Asu92nFrJ86fzukA6eQ?e=LJvYv3) , which shows box-plots of the raw and transformed counts. The Cook's distance for this particular outlier is ~4 which is below the threshold for replacement.

The code chunk below simulates the issues I described. Here, gene1 has a single outlier in the A group which is below the cooks and is therefore not replaced.

require(DESeq2)
require(dplyr)

set.seed(42)
n_genes <- 500
m_samples <- 60

# Creating DDS with 3 groups
dds <- makeExampleDESeqDataSet(n=n_genes, m = m_samples)
colData(dds) <- DataFrame(condition=factor(c(rep('A', m_samples/4 - 3),rep('B', m_samples/4 + 3),rep('C', m_samples/2))),row.names = colnames(dds))
mcols(dds) <- NULL # not sure if neccessary

# Creates NB counts with last group having highest mean
# and first containing a single high count outlier
counts(dds)[1,] <- as.integer(c(rnbinom(m_samples/2, 5, mu=50) + rnbinom(m_samples/2, 5, mu=25), rnbinom(m_samples/2, 1, mu=150) + rnbinom(m_samples/2, 1, mu=100)))
counts(dds)[1, 1] <- as.integer(4*max(counts(dds)[1,]))
counts(dds) <- matrix(sapply(counts(dds), as.integer), dim(counts(dds)))
plotCounts(dds, gene = 'gene1')
results(DESeq(dds))['gene1',]

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] reshape2_1.4.3              ggpubr_0.2
[3] magrittr_1.5                ggplot2_3.1.0
[5] bindrcpp_0.2.2              DESeq2_1.22.1
[7] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
[9] BiocParallel_1.16.2         matrixStats_0.54.0
[11] Biobase_2.42.0              GenomicRanges_1.34.0
[13] GenomeInfoDb_1.18.1         IRanges_2.16.0
[15] S4Vectors_0.20.1            BiocGenerics_0.28.0
[17] dplyr_0.7.8

loaded via a namespace (and not attached):
[1] bitops_1.0-6           bit64_0.9-7            doParallel_1.0.14
[4] RColorBrewer_1.1-2     rprojroot_1.3-2        ggsci_2.9
[7] tools_3.5.1            backports_1.1.2        utf8_1.1.4
[10] R6_2.3.0               rpart_4.1-13           Hmisc_4.1-1
[13] DBI_1.0.0              lazyeval_0.2.1         colorspace_1.3-2
[16] nnet_7.3-12            withr_2.1.2            tidyselect_0.2.5
[19] gridExtra_2.3          bit_1.1-14             compiler_3.5.1
[22] cli_1.0.1              htmlTable_1.12         labeling_0.3
[25] scales_1.0.0           checkmate_1.8.5        genefilter_1.64.0
[28] stringr_1.3.1          digest_0.6.18          foreign_0.8-71
[31] rmarkdown_1.10         XVector_0.22.0         base64enc_0.1-3
[34] pkgconfig_2.0.2        basicOmics_0.1.0       htmltools_0.3.6
[37] limma_3.38.3           htmlwidgets_1.3        rlang_0.3.0.1
[40] rstudioapi_0.8         RSQLite_2.1.1          BiocInstaller_1.32.1
[43] bindr_0.1.1            jsonlite_1.5           acepack_1.4.1
[46] RCurl_1.95-4.11        GenomeInfoDbData_1.2.0 Formula_1.2-3
[49] Matrix_1.2-15          Rcpp_1.0.0             munsell_0.5.0
[52] fansi_0.4.0            stringi_1.2.4          yaml_2.2.0
[55] zlibbioc_1.28.0        plyr_1.8.4             grid_3.5.1
[58] blob_1.1.1             crayon_1.3.4           lattice_0.20-38
[61] cowplot_0.9.3          splines_3.5.1          annotate_1.60.0
[64] locfit_1.5-9.1         knitr_1.20             pillar_1.3.0
[67] geneplotter_1.60.0     codetools_0.2-15       XML_3.98-1.16
[70] glue_1.3.0             evaluate_0.12          latticeExtra_0.6-28
[73] data.table_1.11.8      BiocManager_1.30.4     foreach_1.4.4
[76] gtable_0.2.0           purrr_0.2.5            assertthat_0.2.0
[79] xtable_1.8-3           ggcorrplot_0.1.2       survival_2.43-3
[82] tibble_1.4.2           iterators_1.0.10       AnnotationDbi_1.44.0
[85] memoise_1.1.0          cluster_2.0.7-1

modified 7 months ago by Michael Love24k • written 7 months ago by lefeverde10
Answer: DESeq2 count outlier replacement from Cook's threshold
1
7 months ago by
Michael Love24k
United States
Michael Love24k wrote:

The motivation of that value for the threshold is given in the DESeq2 paper in the section that talks about outliers.  In short it means that the beta vector shouldn’t move too much with the removal of a single sample.  You can lower it globally for a particular dataset, and actually, in linear regression people often use a cutoff of 1. But we decided to use a threshold that had n, the number of samples, and p, the number of coefficients, in the formula.  The danger is that too many observations would be called outliers, which doesn’t seem like a good approach to me. We’ve been working on better approaches to the outlier problem, but in the mean time you could try the outlier downweighting approach in edgeR or limma,  if lowering the threshold to one ends up with too many observations called outliers.

Hi Michael!

Thanks for getting back to me so quickly! I went back over your paper and the vignette and my interpretation was that the TMM replacement method would err on the side of false negatives rather than positives.

This approach is conservative, it will not lead to false positives, as it replaces the outlier value with the value predicted by the null hypothesis.

This seems a reasonable trade off compared to the alternative methods described in your paper. My initial concern was that lowering the threshold would produce an unpredictable bias. However, after thinking a bit more, I suppose that any threshold I choose will replace the counts of 0 to n samples with the TMM. If I set the threshold such that all n samples have their counts replaced with the TMM, then I would obviously not reject the null. If I did this with half the samples, then I would expect the counts to be more similar than the they were initially and thus less likely to reject the null. Likewise for any number of samples' counts with the TMM.

For my own dataset, the number of genes containing a Cook's value >= 1 is around 700, and of those only about 200 had a padj < .05. So my assumption is that lowering the threshold will not drastically alter the results.

To implement TMM replacement using a lowered Cook's cutoff, I ran the standard DESeq wrapper function, followed by updating the DESeqDataSet object using the replaceOutliersWithTrimmedMean and then re-running DESeq wrapper with the updated counts. This is shown in the following code chunk:

dds <- DESeq(dds) %>%
replaceOutliersWithTrimmedMean(., cooksCutoff = 1) %>%
DESeq(.)

1

That should work.

Well hopefully have an improved method for outliers out soon next year.