Question: Question about outlier detection for continuous phenotype using DESeq2
gravatar for xuren2120
4 months ago by
xuren21200 wrote:

Hi everyone,

I have several questions about outlier detection for continuous phenotype using DESeq2. When analyzing continuous phenotype, I got weird results. Below is a reproducible example.

Suppose the phenotype of interest is a continuous variable. However, the continuous variable only takes integer values. e.g. the pheno1 in the following short example.


countData = matrix(rnbinom(30*20, mu=1000, size=0.5), ncol=20)
mode(countData) = "integer"
pheno1 = c(rep(3,3),rep(5,5),rep(2,3),rep(10,4),rep(15,5))
## 3  3  3  5  5  5  5  5  2  2  2 10 10 10 10 15 15 15 15 15

colData = DataFrame(pheno = pheno1)
dds = DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~ pheno)    
dds = DESeq(dds)

the maxCooks column in mcols(dds) is not NA, meaning that the outlier detection will be performed in results(dds,name="pheno").

However, if we add a very small random number to the pheno1, e.g pheno2 in the following code

pheno2 = pheno1 + rnorm(20,0,0.0001)
> pheno2
 [1]  2.999937  3.000018  2.999916  5.000160  5.000033  4.999918  5.000049
 [8]  5.000074  2.000058  1.999969  2.000151 10.000039  9.999938  9.999779
[15] 10.000112 14.999996 14.999998 15.000094 15.000082 15.000059
colData2 = DataFrame(pheno = pheno2)
dds2 = DESeqDataSetFromMatrix(countData = countData, colData = colData2, design = ~ pheno)    
dds2 = DESeq(dds2)

the maxCooks column in mcols(dds2) is ALL NA, meaning that no outlier will be detected.

In this example, results(dds,name="pheno") and results(dds2,name="pheno") produce almost the same result. But in real data, they could be different, especially for the pvalue column, due to the outlier detection.

My question is: (1) For the two examples above, the difference between pheno1 and pheno2 is very tiny. But when calling results function, one performs outlier detection, the other one doesn't. Is this make sense? (2) For the continuous phenotype, should the outlier be considered when using DESeq2?

Thanks in advance.

Xu Ren

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] DESeq2_1.18.1              SummarizedExperiment_1.8.1
 [3] DelayedArray_0.4.1         matrixStats_0.54.0
 [5] Biobase_2.38.0             GenomicRanges_1.30.3
 [7] GenomeInfoDb_1.14.0        IRanges_2.12.0
 [9] S4Vectors_0.16.0           BiocGenerics_0.24.0

loaded via a namespace (and not attached):
 [1] bit64_0.9-7            splines_3.4.0          Formula_1.2-3
 [4] assertthat_0.2.0       latticeExtra_0.6-28    blob_1.1.1
 [7] GenomeInfoDbData_1.0.0 pillar_1.3.1           RSQLite_2.1.1
[10] backports_1.1.3        lattice_0.20-35        glue_1.3.0
[13] digest_0.6.18          RColorBrewer_1.1-2     XVector_0.18.0
[16] checkmate_1.9.1        colorspace_1.4-0       htmltools_0.3.6
[19] Matrix_1.2-9           plyr_1.8.4             XML_3.98-1.17
[22] pkgconfig_2.0.2        genefilter_1.60.0      zlibbioc_1.24.0
[25] purrr_0.3.0            xtable_1.8-3           scales_1.0.0
[28] BiocParallel_1.12.0    htmlTable_1.13.1       tibble_2.0.1
[31] annotate_1.56.2        ggplot2_3.1.0          nnet_7.3-12
[34] lazyeval_0.2.1         survival_2.41-3        magrittr_1.5
[37] crayon_1.3.4           memoise_1.1.0          foreign_0.8-67
[40] tools_3.4.0            data.table_1.12.0      stringr_1.4.0
[43] locfit_1.5-9.1         munsell_0.5.0          cluster_2.0.6
[46] AnnotationDbi_1.40.0   bindrcpp_0.2.2         compiler_3.4.0
[49] rlang_0.3.1            grid_3.4.0             RCurl_1.95-4.11
[52] rstudioapi_0.9.0       htmlwidgets_1.3        bitops_1.0-6
[55] base64enc_0.1-3        gtable_0.2.0           DBI_1.0.0
[58] R6_2.3.0               gridExtra_2.3          knitr_1.21
[61] dplyr_0.7.8            bit_1.1-14             bindr_0.1.1
[64] Hmisc_4.2-0            stringi_1.3.1          Rcpp_1.0.0
[67] geneplotter_1.56.0     rpart_4.1-11           acepack_1.4.1
[70] tidyselect_0.2.5       xfun_0.4

deseq2 • 83 views
ADD COMMENTlink modified 4 months ago by Michael Love25k • written 4 months ago by xuren21200
Answer: Question about outlier detection for continuous phenotype using DESeq2
gravatar for Michael Love
4 months ago by
Michael Love25k
United States
Michael Love25k wrote:

The outlier procedure works only for blocks of similar samples which are determined by looking at the design matrix. By adding noise you make no samples similar. This is the procedure as it stands since 2014. We’ve actually worked on nonparametric methods which are far more robust against outliers, in the fishpond Bioconductor package that was just released, and extends the SAMseq method.

ADD COMMENTlink written 4 months ago by Michael Love25k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 285 users visited in the last hour