Question about outlier detection for continuous phenotype using DESeq2
1
0
Entering edit mode
xuren2120 • 0
@xuren2120-15845
Last seen 5.0 years ago

Hi everyone,

I have several questions about outlier detection for continuous phenotype using DESeq2. When analyzing continuous phenotype, I got weird results. Below is a reproducible example.

Suppose the phenotype of interest is a continuous variable. However, the continuous variable only takes integer values. e.g. the pheno1 in the following short example.

library(DESeq2)

set.seed(1)
countData = matrix(rnbinom(30*20, mu=1000, size=0.5), ncol=20)
mode(countData) = "integer"
pheno1 = c(rep(3,3),rep(5,5),rep(2,3),rep(10,4),rep(15,5))
##pheno1
## 3  3  3  5  5  5  5  5  2  2  2 10 10 10 10 15 15 15 15 15

colData = DataFrame(pheno = pheno1)
dds = DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~ pheno)    
dds = DESeq(dds)
mcols(dds)

the maxCooks column in mcols(dds) is not NA, meaning that the outlier detection will be performed in results(dds,name="pheno").

However, if we add a very small random number to the pheno1, e.g pheno2 in the following code

set.seed(1)
pheno2 = pheno1 + rnorm(20,0,0.0001)
pheno2
> pheno2
 [1]  2.999937  3.000018  2.999916  5.000160  5.000033  4.999918  5.000049
 [8]  5.000074  2.000058  1.999969  2.000151 10.000039  9.999938  9.999779
[15] 10.000112 14.999996 14.999998 15.000094 15.000082 15.000059
colData2 = DataFrame(pheno = pheno2)
dds2 = DESeqDataSetFromMatrix(countData = countData, colData = colData2, design = ~ pheno)    
dds2 = DESeq(dds2)
mcols(dds2)

the maxCooks column in mcols(dds2) is ALL NA, meaning that no outlier will be detected.

In this example, results(dds,name="pheno") and results(dds2,name="pheno") produce almost the same result. But in real data, they could be different, especially for the pvalue column, due to the outlier detection.

My question is: (1) For the two examples above, the difference between pheno1 and pheno2 is very tiny. But when calling results function, one performs outlier detection, the other one doesn't. Is this make sense? (2) For the continuous phenotype, should the outlier be considered when using DESeq2?

Thanks in advance.

Xu Ren

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] DESeq2_1.18.1              SummarizedExperiment_1.8.1
 [3] DelayedArray_0.4.1         matrixStats_0.54.0
 [5] Biobase_2.38.0             GenomicRanges_1.30.3
 [7] GenomeInfoDb_1.14.0        IRanges_2.12.0
 [9] S4Vectors_0.16.0           BiocGenerics_0.24.0

loaded via a namespace (and not attached):
 [1] bit64_0.9-7            splines_3.4.0          Formula_1.2-3
 [4] assertthat_0.2.0       latticeExtra_0.6-28    blob_1.1.1
 [7] GenomeInfoDbData_1.0.0 pillar_1.3.1           RSQLite_2.1.1
[10] backports_1.1.3        lattice_0.20-35        glue_1.3.0
[13] digest_0.6.18          RColorBrewer_1.1-2     XVector_0.18.0
[16] checkmate_1.9.1        colorspace_1.4-0       htmltools_0.3.6
[19] Matrix_1.2-9           plyr_1.8.4             XML_3.98-1.17
[22] pkgconfig_2.0.2        genefilter_1.60.0      zlibbioc_1.24.0
[25] purrr_0.3.0            xtable_1.8-3           scales_1.0.0
[28] BiocParallel_1.12.0    htmlTable_1.13.1       tibble_2.0.1
[31] annotate_1.56.2        ggplot2_3.1.0          nnet_7.3-12
[34] lazyeval_0.2.1         survival_2.41-3        magrittr_1.5
[37] crayon_1.3.4           memoise_1.1.0          foreign_0.8-67
[40] tools_3.4.0            data.table_1.12.0      stringr_1.4.0
[43] locfit_1.5-9.1         munsell_0.5.0          cluster_2.0.6
[46] AnnotationDbi_1.40.0   bindrcpp_0.2.2         compiler_3.4.0
[49] rlang_0.3.1            grid_3.4.0             RCurl_1.95-4.11
[52] rstudioapi_0.9.0       htmlwidgets_1.3        bitops_1.0-6
[55] base64enc_0.1-3        gtable_0.2.0           DBI_1.0.0
[58] R6_2.3.0               gridExtra_2.3          knitr_1.21
[61] dplyr_0.7.8            bit_1.1-14             bindr_0.1.1
[64] Hmisc_4.2-0            stringi_1.3.1          Rcpp_1.0.0
[67] geneplotter_1.56.0     rpart_4.1-11           acepack_1.4.1
[70] tidyselect_0.2.5       xfun_0.4

deseq2 • 419 views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 4 hours ago
United States

The outlier procedure works only for blocks of similar samples which are determined by looking at the design matrix. By adding noise you make no samples similar. This is the procedure as it stands since 2014. We’ve actually worked on nonparametric methods which are far more robust against outliers, in the fishpond Bioconductor package that was just released, and extends the SAMseq method.

ADD COMMENT

Login before adding your answer.

Traffic: 738 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6