Hi,
I have been using DESeq2 together with weights obtained from the zinbwave package for shotgun data analysis (as proposed here in the case of single-cell applications: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1406-4).
In DESeq2, we can specify contrasts among other things either by a character vector (e.g., contrast = c("condition","B","A")
) or by a numeric vector with one element for each element in resultsNames(object)
. If my understanding is correct, in the first case, the contrast is computed using the already computed results, and in the second case, some elements are recomputed.
My issue is that I am getting different results whether I use the character vector or the numeric vector, even if specifying the exact same contrast. And I only have this issue when I use zinbwave weights. So, I suspect that weights are handled differently in the main DESeq2 analysis, and in the recomputation performed when we specify contrasts as a numeric vector. However, I do not know which one is the correct result.
Below is a very simple example using a phylum-level dataset from the curatedMetagenomicData R package (phyloseq object can be found here as I changed a few things in the metadata: Example data set).
I obtain a few major differences on lfcSE and thus on p-values. Did anyone encounter this issue or have an idea on which contrast version would provide the correct results? I am using R 4.4.0 and not the newest DESeq2 version (DESeq2_1.44.0).
# Import libraries
library(phyloseq)
library(dplyr)
library(DESeq2)
library(zinbwave)
# Keep only TF samples and exclude null taxa
data <- subset_samples(phseq_bengtsson_phy, time=="TF")
data <- subset_taxa(data, rowSums(otu_table(data))>0)
# DESeq2 - zinbwave run
## Create DESeq object with weights
dds <- phyloseq_to_deseq2(data, design = ~ 1)
zinb <- zinbwave(dds, K = 0, observationalWeights = TRUE,
BPPARAM = BiocParallel::SerialParam(), epsilon = 1e12)
dds_zinbwave <- DESeqDataSet(zinb, design = ~travel_destination)
## Perform DESeq2 analysis
mod <- DESeq(dds_zinbwave, test="Wald", sfType="poscounts")
resultsNames(mod)
# [1] "Intercept" "travel_destination_Indian_Peninsula_vs_Central_Africa"
## Using the contrast option as a character vector
res_zinb_1 <- data.frame(results(mod, contrast=c("travel_destination","Indian_Peninsula","Central_Africa"), alpha=0.05))
## Using the contrast option as a numeric vector
contraster <- c(0,1)
names(contraster) <- resultsNames(mod)
contraster
# Intercept travel_destination_Indian_Peninsula_vs_Central_Africa
# 0 1
res_zinb_2 <- data.frame(DESeq2::results(mod, contrast = contraster, alpha = 0.05))
## Comparison
res_zinb_1-res_zinb_2
# baseMean log2FoldChange lfcSE stat pvalue padj
# Firmicutes 0 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 4.828636e-02
# Bacteroidota 0 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 -1.605324e-02
# Verrucomicrobia 0 0.000000e+00 2.220446e-16 -4.440892e-16 2.081668e-17 6.938894e-17
# Lentisphaerae 0 -4.440892e-16 0.000000e+00 -8.881784e-16 -1.084202e-19 -5.421011e-19
# Proteobacteria 0 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
# Candidatus Melainabacteria 0 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 NA
# Actinobacteria 0 0.000000e+00 -4.851477e+01 -3.347592e-01 -2.620144e-01 -1.803764e-01
# Euryarchaeota 0 1.110223e-16 4.815127e-01 -7.162905e-01 2.601332e-01 3.117906e-01
# Synergistetes 0 0.000000e+00 -1.235062e-01 -2.520104e-01 -5.553688e-02 -6.484741e-02
# Spirochaetes 0 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 NA
# Ascomycota 0 0.000000e+00 6.351663e-01 9.365578e+00 8.224797e-64 8.224797e-63
# Fusobacteria 0 0.000000e+00 -1.061564e+00 1.622202e-01 -1.244158e-01 -1.803764e-01
sessionInfo( )
# R version 4.4.0 (2024-04-24 ucrt)
# Platform: x86_64-w64-mingw32/x64
# Running under: Windows 11 x64 (build 22631)
#
# Matrix products: default
#
#
# locale:
# [1] LC_COLLATE=French_France.utf8 LC_CTYPE=French_France.utf8 LC_MONETARY=French_France.utf8 LC_NUMERIC=C LC_TIME=French_France.utf8
#
# time zone: Europe/Paris
# tzcode source: internal
#
# attached base packages:
# [1] stats4 stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] zinbwave_1.26.0 SingleCellExperiment_1.26.0 DESeq2_1.44.0 SummarizedExperiment_1.34.0 Biobase_2.64.0 MatrixGenerics_1.16.0
# [7] matrixStats_1.3.0 GenomicRanges_1.56.1 GenomeInfoDb_1.40.1 IRanges_2.38.1 S4Vectors_0.42.1 BiocGenerics_0.50.0
# [13] dplyr_1.1.4 phyloseq_1.48.0
#
# loaded via a namespace (and not attached):
# [1] DBI_1.2.3 permute_0.9-7 rlang_1.1.4 magrittr_2.0.3 ade4_1.7-22 compiler_4.4.0 RSQLite_2.3.7
# [8] mgcv_1.9-1 png_0.1-8 vctrs_0.6.5 reshape2_1.4.4 stringr_1.5.1 pkgconfig_2.0.3 crayon_1.5.3
# [15] fastmap_1.2.0 XVector_0.44.0 utf8_1.2.4 UCSC.utils_1.0.0 purrr_1.0.2 bit_4.0.5 zlibbioc_1.50.0
# [22] cachem_1.1.0 jsonlite_1.8.8 biomformat_1.32.0 blob_1.2.4 rhdf5filters_1.16.0 DelayedArray_0.30.1 Rhdf5lib_1.26.0
# [29] BiocParallel_1.38.0 parallel_4.4.0 cluster_2.1.6 R6_2.5.1 stringi_1.8.4 limma_3.60.4 genefilter_1.86.0
# [36] Rcpp_1.0.13 iterators_1.0.14 Matrix_1.7-0 splines_4.4.0 igraph_2.0.3 tidyselect_1.2.1 rstudioapi_0.16.0
# [43] abind_1.4-5 vegan_2.6-8 doParallel_1.0.17 codetools_0.2-20 lattice_0.22-6 tibble_3.2.1 plyr_1.8.9
# [50] KEGGREST_1.44.1 survival_3.7-0 Biostrings_2.72.1 pillar_1.9.0 foreach_1.5.2 softImpute_1.4-1 generics_0.1.3
# [57] ggplot2_3.5.1 munsell_0.5.1 scales_1.3.0 xtable_1.8-4 glue_1.7.0 tools_4.4.0 data.table_1.16.0
# [64] annotate_1.82.0 locfit_1.5-9.10 XML_3.99-0.17 rhdf5_2.48.0 grid_4.4.0 tidyr_1.3.1 ape_5.8
# [71] AnnotationDbi_1.66.0 edgeR_4.2.1 colorspace_2.1-1 nlme_3.1-166 GenomeInfoDbData_1.2.12 cli_3.6.3 fansi_1.0.6
# [78] S4Arrays_1.4.1 gtable_0.3.5 digest_0.6.37 SparseArray_1.4.8 memoise_2.0.1 multtest_2.60.0 lifecycle_1.0.4
# [85] httr_1.4.7 statmod_1.5.0 bit64_4.0.5 MASS_7.3-61
Thank you Michael Love for your answer. I am not sure this is the cause for what I see here. In this example, I am only comparing 2 groups, and I have filtered out variables which were null for all samples. The biggest differences are observed on lfcSE rather than fold changes (e.g. Actinobacteria). This must be related somhow to the use of weights because I do not have these differences when using DESeq2 without weights.
I'm not sure, because I can see in
results()
that it is again using the same weights matrix when it re-computes a contrast.Do you say that the lfcSE are mostly the same? How many are basically the same and how many are different?
Looking at the above, check if the discrepant cases in terms of lfcSE have LFC=0 when using the character contrast. This is expected behavior. And it can happen here I think if you have zeros for all the samples in a group excluding those samples with very low weight (essentially then you have all zeros for the observed data, positive for data that is downweighted to no effect on inference).
I have discrepancies of lfcse on 5 phyla out of 12. None of them had a logFC at 0. I tried to focus on 2 examples to better understand:
Fusobacteria
Results:
Character contrast:
Numeric contrast:
Actinobacteria
Results:
Character contrast:
Numeric contrast:
I have no idea why I got different results on this one.
I'll take a look.
Thanks for pointing this out! I found a bug that the wrong weights matrix was being used when recomputing the SE within results() for the numeric-style contrast. I'll fix this in devel / github and push a fix to the release branch as well.
I pushed 1.48.2 to release, or you can get the latest with
remotes::install_github("thelovelab/DESeq2")
Thank you so much!
Is it the case that you are running DESeq2 across 12 total features? This is a larger issue: I wouldn't use an NB model on such counts. The size factor estimation is surely not appropriately dealing with the compositional nature of the data.
Yes, I know. This was just to get a simple example ;-).
Ok good! Thanks for the example and helping me hunt down the bug.