Hello,
I'm using DESeq2 to analyse a large dataset with basically three predictors:
1 - Subjects = the ID of the subject involved in the experiment (categorical)
2 - Culture medium = the culture medium (categorical)
3 - Time = the number of days (continuous)
I would like to inspect the proportion of variance explained by the three predictors for each gene to categorize them based on the most influencing factor (this is only an exploratory analysis). Following the DESeq2 manual, the paper, and this post, I wrote this little script but I'm not sure if it is correct:
models <- list( all = "~ medium + subject + days", no.food = "~ subject + days", no.subject = "~ days", null = "~ 1" ) deviances <- lapply(models, function(m){ design(dds) <- as.formula(m) dds <- DESeq(dds, test = "Wald", fitType = "parametric") mcols(dds)$deviance }) residual <- deviances[[length(deviances)]] deviances <- lapply(seq_along(deviances)[-1], function(i){ n <- i - 1 abs(deviances[[i]] - deviances[[n]]) }) deviances <- data.frame(do.call(cbind, deviances)) deviances$residual <- residual - rowSums(deviances) deviances <- data.frame(t(apply(deviances, 1, function(x) x / sum(x)))) deviances <- setNames(deviances, c("medium", "subject", "days", "residual"))
Probably I'm doing something wrong since I ended up with negative residuals for some genes ...
Thanks in advance!
cheers
Giovanni
sessionInfo():
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS
Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=it_IT.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=it_IT.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=it_IT.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggtern_2.2.1 vegan_2.4-6 lattice_0.20-35
[4] permute_0.9-4 ggsci_2.8 scales_0.5.0.9000
[7] bindrcpp_0.2 dplyr_0.7.4 ape_5.0
[10] ggthemes_3.4.0 DESeq2_1.18.1 SummarizedExperiment_1.8.1
[13] DelayedArray_0.4.1 matrixStats_0.53.1 GenomicRanges_1.30.2
[16] GenomeInfoDb_1.14.0 IRanges_2.12.0 S4Vectors_0.16.0
[19] phyloseq_1.23.1 metagenomeSeq_1.20.1 RColorBrewer_1.1-2
[22] glmnet_2.0-13 foreach_1.4.4 Matrix_1.2-11
[25] limma_3.34.9 Biobase_2.38.0 BiocGenerics_0.24.0
[28] gridExtra_2.3 ggplot2_2.2.1
loaded via a namespace (and not attached):
[1] nlme_3.1-131.1 bitops_1.0-6 bit64_0.9-7 latex2exp_0.4.0
[5] tensorA_0.36 tools_3.4.4 backports_1.1.2 utf8_1.1.3
[9] R6_2.2.2 rpart_4.1-13 KernSmooth_2.23-15 Hmisc_4.1-1
[13] DBI_0.7 lazyeval_0.2.1 mgcv_1.8-23 colorspace_1.3-2
[17] ade4_1.7-10 nnet_7.3-12 bayesm_3.1-0.1 bit_1.1-12
[21] compiler_3.4.4 compositions_1.40-1 cli_1.0.0 htmlTable_1.11.2
[25] labeling_0.3 caTools_1.17.1 checkmate_1.8.5 DEoptimR_1.0-8
[29] robustbase_0.92-8 genefilter_1.60.0 stringr_1.3.0 digest_0.6.15
[33] foreign_0.8-69 XVector_0.18.0 base64enc_0.1-3 pkgconfig_2.0.1
[37] htmltools_0.3.6 htmlwidgets_1.0 rlang_0.2.0 rstudioapi_0.7
[41] RSQLite_2.0 bindr_0.1 energy_1.7-2 jsonlite_1.5
[45] BiocParallel_1.12.0 gtools_3.5.0 acepack_1.4.1 RCurl_1.95-4.10
[49] magrittr_1.5 GenomeInfoDbData_1.0.0 Formula_1.2-2 biomformat_1.6.0
[53] Rcpp_0.12.15 munsell_0.4.3 proto_1.0.0 stringi_1.1.6
[57] yaml_2.1.16 MASS_7.3-49 zlibbioc_1.24.0 rhdf5_2.22.0
[61] gplots_3.0.1 plyr_1.8.4 blob_1.1.0 gdata_2.18.0
[65] crayon_1.3.4 Biostrings_2.46.0 splines_3.4.4 multtest_2.34.0
[69] annotate_1.56.1 locfit_1.5-9.1 knitr_1.20 pillar_1.1.0
[73] igraph_1.1.2 boot_1.3-20 geneplotter_1.56.0 reshape2_1.4.3
[77] codetools_0.2-15 glue_1.2.0 XML_3.98-1.10 latticeExtra_0.6-28
[81] data.table_1.10.4-3 gtable_0.2.0 assertthat_0.2.0 xtable_1.8-2
[85] survival_2.41-3 tibble_1.4.2 iterators_1.0.9 AnnotationDbi_1.40.0
[89] memoise_1.1.0 cluster_2.0.6
Sorry if I come back to this after quite a long time, but I'm still struggling to produce reliable results. It seems that the above calculation produced some negative values. I would say that the factor considered is somehow worsening the model by increasing the deviance of the full model in respect with the reduced one but I don't know how to deal with it. Should I treat those values as zeros or should I leave them as negative values?
Thanks again!
Giovanni
I would suggest using the dispersion from the full model. If you run separate models, you are re-estimating dispersion each time. You can run estimateDispersion(dds) for the full model and then for the reduced models, you can set dispersion with dispersions(dds_reduce) <- dispersions(dds_full). Then you only need to re-run the nbinomWald() function, not the whole procedure.
Dear Michael,
thanks for your help! I changed my code as follows:
This way I need only to change the design of my experiment leaving everything else as already estimated in the full model . Then, to build the full deviance table:
This seems to work, now I get only positive values and they seem reasonable. Does it sound good to you?
Thaks again,
ciao,
Giovanni
I wouldn't use the beta prior if you're calculating deviances and comparing them across models. The LRT theory is not for biased estimates of coefficients (the beta prior induces a bias).