Question: Deviance decomposition with DESeq2
0
11 months ago by
Italy, Florence, University of Florence
Giovanni Bacci0 wrote:

Hello,

I'm using DESeq2 to analyse a large dataset with basically three predictors:

1 - Subjects = the ID of the subject involved in the experiment (categorical)

2 - Culture medium = the culture medium (categorical)

3 - Time = the number of days (continuous)

I would like to inspect the proportion of variance explained by the three predictors for each gene to categorize them based on the most influencing factor (this is only an exploratory analysis). Following the DESeq2 manual, the paper, and this post, I wrote this little script but I'm not sure if it is correct:

models <- list(
all = "~ medium + subject + days",
no.food = "~ subject + days",
no.subject = "~ days",
null = "~ 1"
)

deviances <- lapply(models, function(m){
design(dds) <- as.formula(m)
dds <- DESeq(dds, test = "Wald", fitType = "parametric")
mcols(dds)$deviance }) residual <- deviances[[length(deviances)]] deviances <- lapply(seq_along(deviances)[-1], function(i){ n <- i - 1 abs(deviances[[i]] - deviances[[n]]) }) deviances <- data.frame(do.call(cbind, deviances)) deviances$residual <- residual - rowSums(deviances)
deviances <- data.frame(t(apply(deviances, 1, function(x) x / sum(x))))
deviances <- setNames(deviances, c("medium", "subject", "days", "residual"))

Probably I'm doing something wrong since I ended up with negative residuals for some genes ...

cheers

Giovanni

sessionInfo():

R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=it_IT.UTF-8
[4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=it_IT.UTF-8    LC_MESSAGES=en_US.UTF-8
[10] LC_TELEPHONE=C             LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  grid      stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ggtern_2.2.1               vegan_2.4-6                lattice_0.20-35
[4] permute_0.9-4              ggsci_2.8                  scales_0.5.0.9000
[7] bindrcpp_0.2               dplyr_0.7.4                ape_5.0
[10] ggthemes_3.4.0             DESeq2_1.18.1              SummarizedExperiment_1.8.1
[13] DelayedArray_0.4.1         matrixStats_0.53.1         GenomicRanges_1.30.2
[16] GenomeInfoDb_1.14.0        IRanges_2.12.0             S4Vectors_0.16.0
[19] phyloseq_1.23.1            metagenomeSeq_1.20.1       RColorBrewer_1.1-2
[22] glmnet_2.0-13              foreach_1.4.4              Matrix_1.2-11
[25] limma_3.34.9               Biobase_2.38.0             BiocGenerics_0.24.0
[28] gridExtra_2.3              ggplot2_2.2.1

loaded via a namespace (and not attached):
[1] nlme_3.1-131.1         bitops_1.0-6           bit64_0.9-7            latex2exp_0.4.0
[5] tensorA_0.36           tools_3.4.4            backports_1.1.2        utf8_1.1.3
[9] R6_2.2.2               rpart_4.1-13           KernSmooth_2.23-15     Hmisc_4.1-1
[13] DBI_0.7                lazyeval_0.2.1         mgcv_1.8-23            colorspace_1.3-2
[21] compiler_3.4.4         compositions_1.40-1    cli_1.0.0              htmlTable_1.11.2
[25] labeling_0.3           caTools_1.17.1         checkmate_1.8.5        DEoptimR_1.0-8
[29] robustbase_0.92-8      genefilter_1.60.0      stringr_1.3.0          digest_0.6.15
[33] foreign_0.8-69         XVector_0.18.0         base64enc_0.1-3        pkgconfig_2.0.1
[37] htmltools_0.3.6        htmlwidgets_1.0        rlang_0.2.0            rstudioapi_0.7
[41] RSQLite_2.0            bindr_0.1              energy_1.7-2           jsonlite_1.5
[45] BiocParallel_1.12.0    gtools_3.5.0           acepack_1.4.1          RCurl_1.95-4.10
[49] magrittr_1.5           GenomeInfoDbData_1.0.0 Formula_1.2-2          biomformat_1.6.0
[53] Rcpp_0.12.15           munsell_0.4.3          proto_1.0.0            stringi_1.1.6
[57] yaml_2.1.16            MASS_7.3-49            zlibbioc_1.24.0        rhdf5_2.22.0
[61] gplots_3.0.1           plyr_1.8.4             blob_1.1.0             gdata_2.18.0
[65] crayon_1.3.4           Biostrings_2.46.0      splines_3.4.4          multtest_2.34.0
[69] annotate_1.56.1        locfit_1.5-9.1         knitr_1.20             pillar_1.1.0
[73] igraph_1.1.2           boot_1.3-20            geneplotter_1.56.0     reshape2_1.4.3
[77] codetools_0.2-15       glue_1.2.0             XML_3.98-1.10          latticeExtra_0.6-28
[81] data.table_1.10.4-3    gtable_0.2.0           assertthat_0.2.0       xtable_1.8-2
[85] survival_2.41-3        tibble_1.4.2           iterators_1.0.9        AnnotationDbi_1.40.0
[89] memoise_1.1.0          cluster_2.0.6         

modified 11 months ago by Michael Love22k • written 11 months ago by Giovanni Bacci0
0
11 months ago by
Michael Love22k
United States
Michael Love22k wrote:

The deviance I am returning in mcols(dds) is -2 * log likelihood.

The LRT statistic I return is 2 * (full log likelihood - reduced log likelihood) = reduced deviance - full deviance, which is assumed to follow a chi square distribution. Since the log likelihood increases when coefficients are added to the model, and deviance as defined above has the opposite sign, it's expected that reduced deviance > full deviance. You should not put an absolute value in the difference calculation above.

Sorry if I come back to this after quite a long time, but I'm still struggling to produce reliable results. It seems that the above calculation produced some negative values. I would say that the factor considered is somehow worsening the model by increasing the deviance of the full model in respect with the reduced one but I don't know how to deal with it.  Should I treat those values as zeros or should I leave them as negative values?

Thanks again!

Giovanni

I would suggest using the dispersion from the full model. If you run separate models, you are re-estimating dispersion each time. You can run estimateDispersion(dds) for the full model and then for the reduced models, you can set dispersion with dispersions(dds_reduce) <- dispersions(dds_full). Then you only need to re-run the nbinomWald() function, not the whole procedure.

Dear Michael,
thanks for your help! I changed my code as follows:

# Full model
dds.full <- DESeq(dds.full, test = "Wald", fitType = "parametric", betaPrior = T)

# Deviance estimation of reduced models
deviances <- sapply(models[-1], function(m){
dds <- dds.full # Copying the full model
design(dds) <- as.formula(m) # Changing design
dds <- nbinomWaldTest(dds, betaPrior = T) # Re-run wald test
return(mcols(dds)$deviance) # Returning deviances }) deviances <- cbind(mcols(dds.full)$deviance, deviances) # Adding deviances of the full model

This way I need only to change the design of my experiment leaving everything else as already estimated in the full model . Then, to build the full deviance table:

deviances <- data.frame(medium = deviances[,2] - deviances[,1],
food = deviances[,3] - deviances[,2],
days = deviances[,4] - deviances[,3]) %>%
mutate(residual = deviances[,4] - (medium + food + days)) %>%
as.matrix() %>%
prop.table(margin = 1) %>%
data.frame(row.names = rownames(dds.full))

This seems to work, now I get only positive values and they seem reasonable. Does it sound good to you?

Thaks again,

ciao,

Giovanni