Question: Question about outlier detection for continuous phenotype using DESeq2
0
4 months ago by
xuren21200 wrote:

Hi everyone,

I have several questions about outlier detection for continuous phenotype using DESeq2. When analyzing continuous phenotype, I got weird results. Below is a reproducible example.

Suppose the phenotype of interest is a continuous variable. However, the continuous variable only takes integer values. e.g. the pheno1 in the following short example.

library(DESeq2)

set.seed(1)
countData = matrix(rnbinom(30*20, mu=1000, size=0.5), ncol=20)
mode(countData) = "integer"
pheno1 = c(rep(3,3),rep(5,5),rep(2,3),rep(10,4),rep(15,5))
##pheno1
## 3  3  3  5  5  5  5  5  2  2  2 10 10 10 10 15 15 15 15 15

colData = DataFrame(pheno = pheno1)
dds = DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~ pheno)
dds = DESeq(dds)
mcols(dds)



the maxCooks column in mcols(dds) is not NA, meaning that the outlier detection will be performed in results(dds,name="pheno").

However, if we add a very small random number to the pheno1, e.g pheno2 in the following code

set.seed(1)
pheno2 = pheno1 + rnorm(20,0,0.0001)
pheno2
> pheno2
[1]  2.999937  3.000018  2.999916  5.000160  5.000033  4.999918  5.000049
[8]  5.000074  2.000058  1.999969  2.000151 10.000039  9.999938  9.999779
[15] 10.000112 14.999996 14.999998 15.000094 15.000082 15.000059
colData2 = DataFrame(pheno = pheno2)
dds2 = DESeqDataSetFromMatrix(countData = countData, colData = colData2, design = ~ pheno)
dds2 = DESeq(dds2)
mcols(dds2)


the maxCooks column in mcols(dds2) is ALL NA, meaning that no outlier will be detected.

In this example, results(dds,name="pheno") and results(dds2,name="pheno") produce almost the same result. But in real data, they could be different, especially for the pvalue column, due to the outlier detection.

My question is: (1) For the two examples above, the difference between pheno1 and pheno2 is very tiny. But when calling results function, one performs outlier detection, the other one doesn't. Is this make sense? (2) For the continuous phenotype, should the outlier be considered when using DESeq2?

Xu Ren

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] DESeq2_1.18.1              SummarizedExperiment_1.8.1
[3] DelayedArray_0.4.1         matrixStats_0.54.0
[5] Biobase_2.38.0             GenomicRanges_1.30.3
[7] GenomeInfoDb_1.14.0        IRanges_2.12.0
[9] S4Vectors_0.16.0           BiocGenerics_0.24.0

loaded via a namespace (and not attached):
[1] bit64_0.9-7            splines_3.4.0          Formula_1.2-3
[4] assertthat_0.2.0       latticeExtra_0.6-28    blob_1.1.1
[7] GenomeInfoDbData_1.0.0 pillar_1.3.1           RSQLite_2.1.1
[10] backports_1.1.3        lattice_0.20-35        glue_1.3.0
[13] digest_0.6.18          RColorBrewer_1.1-2     XVector_0.18.0
[16] checkmate_1.9.1        colorspace_1.4-0       htmltools_0.3.6
[19] Matrix_1.2-9           plyr_1.8.4             XML_3.98-1.17
[22] pkgconfig_2.0.2        genefilter_1.60.0      zlibbioc_1.24.0
[25] purrr_0.3.0            xtable_1.8-3           scales_1.0.0
[28] BiocParallel_1.12.0    htmlTable_1.13.1       tibble_2.0.1
[31] annotate_1.56.2        ggplot2_3.1.0          nnet_7.3-12
[34] lazyeval_0.2.1         survival_2.41-3        magrittr_1.5
[37] crayon_1.3.4           memoise_1.1.0          foreign_0.8-67
[40] tools_3.4.0            data.table_1.12.0      stringr_1.4.0
[43] locfit_1.5-9.1         munsell_0.5.0          cluster_2.0.6
[46] AnnotationDbi_1.40.0   bindrcpp_0.2.2         compiler_3.4.0
[49] rlang_0.3.1            grid_3.4.0             RCurl_1.95-4.11
[52] rstudioapi_0.9.0       htmlwidgets_1.3        bitops_1.0-6
[55] base64enc_0.1-3        gtable_0.2.0           DBI_1.0.0
[58] R6_2.3.0               gridExtra_2.3          knitr_1.21
[61] dplyr_0.7.8            bit_1.1-14             bindr_0.1.1
[64] Hmisc_4.2-0            stringi_1.3.1          Rcpp_1.0.0
[67] geneplotter_1.56.0     rpart_4.1-11           acepack_1.4.1
[70] tidyselect_0.2.5       xfun_0.4


deseq2 • 83 views
modified 4 months ago by Michael Love25k • written 4 months ago by xuren21200
0
4 months ago by
Michael Love25k
United States
Michael Love25k wrote:

The outlier procedure works only for blocks of similar samples which are determined by looking at the design matrix. By adding noise you make no samples similar. This is the procedure as it stands since 2014. We’ve actually worked on nonparametric methods which are far more robust against outliers, in the fishpond Bioconductor package that was just released, and extends the SAMseq method.