ZINB-WaVE-DESeq2 with Bulk Data: Choice of Epsilon value, K and gene number
1
0
Entering edit mode
@rowcyclecamp-16750
Last seen 5 months ago
United Kingdom

Hello,

My questions are about using ZINB-WaVE.

I have written out the main steps I took to reach this point. If you don't have the time to read it, my questions are written under the heading below. Thank you!

# Questions

• How should I decide on a value for K when running zinbwave()? Will this impact observational weights when using DESeq2?
• I have so far tested K=2 and K=3, based on principal components analysis identifying 2 major sources of variation (shown below). There are 4 different biological conditions and 1 technical covariate, however, so should K be set to reflect this?
• Should I filter genes for the dimensionality reduction step?

• I have tested the top 1000 most variable genes and the entire set of 16000. If I use 16000 genes, I need a higher epsilon value (16000 instead of 1000) to achieve a similar looking representation.
• For calculating observational weights for DESeq2, I am guessing I will use my entire set of 16000 genes?
• How should I decide on a value for epsilon?

• I have tested epsilon values of 1000, 16000, 1e6, 1e12. The impact of changing epsilon seems to depend heavily on the choice of K and the number of genes used.

# Introduction

I am working with a bulk RNA-Seq dataset in which I used translating ribosome affinity purification to extract ribosome-bound RNA from a specific cell type in different regions of mouse brain. The library was prepared using an ultra low input protocol, as I could reliably obtain ~100 pg per sample (in some regions, this cell type is very rare). Each sample was sequenced to a depth of ~25 million reads. There are 120 samples in total, collected over 8 batches (every biological condition was collected in each batch).

I suspect there may be zero inflation in the count values of all samples, as I try to demonstrate below.

# Zero inflation: Calculating dropout rate

To check for zero-inflation, I did the following:

# dds is a DESeq2 object created from a salmon > tximeta > DESeq2 workflow

# filter genes with less than 5 counts in at least 10 samples
dds <- dds[rowSums(counts(dds) >= 5) >= 10,]

# check dimensions
dim(dds)
# ~16000 genes, 120 samples

## calculate dropout
# extract counts (as I could not set 0 values to NA in the DESeq object)
cts <- counts(dds)
# set 0 values to NA
cts[cts == 0] <- NA
# calculate per gene mean count (excluding zero counts) and the proportion of zeros
dropouts <- tibble(mean = rowMeans(cts, na.rm = T),
dropouts = rowSums(is.na(cts))/ncol(cts))

# quantile of dropouts
quantile(dropouts$dropouts, seq(0.1, 1, 0.1)) # 10% 20% 30% 40% 50% 60% 70% #0.008333333 0.075000000 0.183333333 0.300000000 0.416666667 0.533333333 0.650000000 # 80% 90% 100% #0.750000000 0.841666667 0.916666667 # plot mean vs dropouts dropouts %>% ggplot(aes(x = log10(mean), y = dropouts)) + geom_point(alpha = 0.1)  For comparison, below is the same plot for another dataset in which I used Smart-Seq2 with a larger input amount of RNA of the same type of samples: I then went on to look at the dispersion estimates. # Dispersion # calculate size factors and dispersion dds <- estimateSizeFactors(dds, "poscounts") dds <- estimateDispersions(dds, fitType = "local") # plot dispersion estimates plotDispEsts(dds)  My interpretation of this plot is that the dispersion of lower count genes (<1000 counts) is large; large enough to weaken my ability to detect differentially expressed genes within this count range. Is this correct? # PCA I performed principal components analysis to look at how the samples differ. I used the top 1000 most variable genes. I expected 1 major technical and 1 major biological sample covariate to cause separation and this was confirmed. There is a large drop in variance explained by subsequent PC's, although they may represent variation due to other biological conditions tested (not shown). # subsetting 1000 most variable genes vars <- counts(dds) %>% rowVars names(vars) <- rownames(dds) vars <- sort(vars, decreasing = TRUE) zidds <- dds[names(vars)[1:1000],] # recalculate size factors and dispersion - is this necessary? zidds <- estimateSizeFactors(zidds, "poscounts") zidds <- estimateDispersions(zidds, fitType = "local") # variance stabilising transformation vsd <- vst(zidds) # make pea object pca <- prcomp(t(assay(vsd))) # add colData pca$x <- pca$x %>% as_tibble(rownames = identifier) %>% inner_join(colData(zidds), copy = TRUE) # plot PC1 vs PC2 pca$x %>%
ggplot(aes(x = PC1,
y = PC2,
colour = biological,
shape = technical)) +
geom_point()


# calculate the variation per component
var_explained0 <- pca$sdev^2/sum(pca$sdev^2) #Calculate PC variance
names(var_explained0) <- colnames(pca$x) var_explained <- as_tibble(var_explained0, rownames = "PC") %>% mutate(var = paste0((round(var_explained0*100)), ' %', sep = '')) %>% select(PC, var) pca$var_explained <- var_explained

# scree splot
pca$var_explained %>% slice_head(n=10) %>% mutate(PC = factor(PC, levels = c("PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8", "PC9", "PC10")), var = as.integer(str_extract(var, pattern = "[:digit:]+"))) %>% ggplot(aes(x = PC, y = var)) + geom_col()  # ZINB-WaVE I have been following the ZINB-WaVE vignette and the deseq2-zinbwave workflow to try to account for zero inflation. I first want to use ZINB-WaVE's dimensionality reduction method to see how adjusting for dropout influences the low dimension representation. Below is an example of my working. # ZINB dimensionality reduction with 1000 most variable genes, K = 2, epsilon default zinb <- zinbwave(zidds, K = 2) W <- reducedDim(zinb) as_tibble(W, rownames = "sample_name") %>% inner_join(colData(zinb), copy = TRUE) %>% ggplot(aes(W1, W2, colour=biological, shape=technical)) + geom_point()   sessionInfo( ) R version 4.0.3 (2020-10-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.5 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1 locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 grid stats graphics grDevices datasets [8] utils methods base other attached packages: [1] edgeR_3.32.1 limma_3.46.0 [3] zinbwave_1.12.0 SingleCellExperiment_1.12.0 [5] apeglm_1.12.0 biomaRt_2.46.1 [7] BiocParallel_1.24.1 org.Mm.eg.db_3.12.0 [9] GenomicFeatures_1.42.1 AnnotationDbi_1.52.0 [11] rhdf5_2.34.0 tximport_1.18.0 [13] DESeq2_1.30.0 SummarizedExperiment_1.20.0 [15] Biobase_2.50.0 MatrixGenerics_1.2.0 [17] matrixStats_0.57.0 GenomicRanges_1.42.0 [19] GenomeInfoDb_1.26.2 IRanges_2.24.1 [21] S4Vectors_0.28.1 BiocGenerics_0.36.0 [23] tximeta_1.8.3 broom_0.7.3 [25] Rtsne_0.15 factoextra_1.0.7 [27] gwasrapidd_0.99.11 ggvenn_0.1.8 [29] msigdbr_7.2.1 gprofiler2_0.2.0 [31] scales_1.1.1 ggrepel_0.9.1 [33] openxlsx_4.2.3 readxl_1.3.1 [35] kableExtra_1.3.1 ggthemes_4.2.4 [37] car_3.0-10 carData_3.0-4 [39] ggpubr_0.4.0 ggsci_2.9 [41] cowplot_1.1.1 pander_0.6.3 [43] forcats_0.5.0 stringr_1.4.0 [45] dplyr_1.0.3 purrr_0.3.4 [47] readr_1.4.0 tidyr_1.1.2 [49] tibble_3.0.5 ggplot2_3.3.3 [51] tidyverse_1.3.0 devtools_2.3.2 [53] usethis_2.0.0 loaded via a namespace (and not attached): [1] rappdirs_0.3.1 rtracklayer_1.50.0 [3] coda_0.19-4 bit64_4.0.5 [5] knitr_1.30 DelayedArray_0.16.1 [7] hwriter_1.3.2 data.table_1.13.6 [9] RCurl_1.98-1.2 AnnotationFilter_1.14.0 [11] generics_0.1.0 callr_3.5.1 [13] RSQLite_2.2.3 bit_4.0.4 [15] webshot_0.5.2 xml2_1.3.2 [17] lubridate_1.7.9.2 httpuv_1.5.5 [19] assertthat_0.2.1 xfun_0.20 [21] hms_1.0.0 evaluate_0.14 [23] promises_1.1.1 fansi_0.4.2 [25] progress_1.2.2 dbplyr_2.0.0 [27] DBI_1.1.1 geneplotter_1.68.0 [29] htmlwidgets_1.5.3 ellipsis_0.3.1 [31] backports_1.2.1 FactoMineR_2.4 [33] bookdown_0.21 annotate_1.68.0 [35] vctrs_0.3.6 remotes_2.2.0 [37] ensembldb_2.14.0 abind_1.4-5 [39] cachem_1.0.1 withr_2.4.1 [41] bdsmatrix_1.3-4 GenomicAlignments_1.26.0 [43] prettyunits_1.1.1 softImpute_1.4 [45] cluster_2.1.0 lazyeval_0.2.2 [47] crayon_1.3.4 genefilter_1.72.1 [49] pkgconfig_2.0.3 labeling_0.4.2 [51] pkgload_1.1.0 ProtGenerics_1.22.0 [53] rlang_0.4.10 lifecycle_0.2.0 [55] BiocFileCache_1.14.0 modelr_0.1.8 [57] AnnotationHub_2.22.0 cellranger_1.1.0 [59] rprojroot_2.0.2 Matrix_1.3-2 [61] Rhdf5lib_1.12.0 zoo_1.8-8 [63] reprex_1.0.0 processx_3.4.5 [65] png_0.1-7 viridisLite_0.3.0 [67] bitops_1.0-6 rhdf5filters_1.2.0 [69] Biostrings_2.58.0 blob_1.2.1 [71] ShortRead_1.48.0 jpeg_0.1-8.1 [73] rstatix_0.6.0 ggsignif_0.6.0 [75] leaps_3.1 memoise_2.0.0 [77] magrittr_2.0.1 plyr_1.8.6 [79] zlibbioc_1.36.0 compiler_4.0.3 [81] tinytex_0.29 bbmle_1.0.23.1 [83] RColorBrewer_1.1-2 Rsamtools_2.6.0 [85] cli_2.2.0 XVector_0.30.0 [87] ps_1.5.0 MASS_7.3-53 [89] tidyselect_1.1.0 stringi_1.5.3 [91] highr_0.8 emdbook_1.3.12 [93] yaml_2.2.1 askpass_1.1 [95] locfit_1.5-9.4 latticeExtra_0.6-29 [97] tools_4.0.3 rio_0.5.16 [99] rstudioapi_0.13 foreign_0.8-81 [101] scatterplot3d_0.3-41 farver_2.0.3 [103] digest_0.6.27 BiocManager_1.30.10 [105] shiny_1.6.0 Rcpp_1.0.6 [107] BiocVersion_3.12.0 later_1.1.0.1 [109] ngsReports_1.6.1 httr_1.4.2 [111] ggdendro_0.1.22 colorspace_2.0-0 [113] rvest_0.3.6 XML_3.99-0.5 [115] fs_1.5.0 splines_4.0.3 [117] renv_0.12.5 plotly_4.9.3 [119] sessioninfo_1.1.1 xtable_1.8-4 [121] jsonlite_1.7.2 flashClust_1.01-2 [123] testthat_3.0.1 R6_2.5.0 [125] pillar_1.4.7 htmltools_0.5.1.1 [127] mime_0.9 glue_1.4.2 [129] fastmap_1.1.0 DT_0.17 [131] interactiveDisplayBase_1.28.0 pkgbuild_1.2.0 [133] mvtnorm_1.1-1 lattice_0.20-41 [135] numDeriv_2016.8-1.1 curl_4.3 [137] zip_2.1.1 openssl_1.4.3 [139] survival_3.2-7 rmarkdown_2.6 [141] desc_1.2.0 munsell_0.5.0 [143] GenomeInfoDbData_1.2.4 haven_2.3.1 [145] reshape2_1.4.4 gtable_0.3.0  DESeq2 ZINB zinbwave deseq • 398 views ADD COMMENT 0 Entering edit mode davide risso ▴ 870 @davide-risso-5075 Last seen 5 months ago University of Padova Hi, I will respond to the ZINB-WaVE specific questions, but I will let Mike or other DESeq2 experts comment on the dispersion plot, as I'm not sure how "unusual" that look and if it justifies the use of weights. A quick aside: I guess that a missing piece to further comment on that is, how do the dispersion estimates change once you take zero inflation into account with zinbwave weights? Do the weights lead to a better plot? Now onto the questions: • How should I decide on a value for K when running zinbwave()? Will this impact observational weights when using DESeq2? There is no set-in-stone rule to choose K for zinbwave.What we suggest is to look at the factors and see if they explain interesting variation in the data or not. We do have however a couple of functions that can help you choose K, namely zinbAIC and zinbBIC, which will compute the Akaike and Baysian Information Criteria, respectively. Note that these need to be applied to the output of zinbFit rather than zinbwave. The smaller AIC and BIC the better, you could try several values of K and choose the one that minimizes AIC and/or BIC. The value of K enters into the weights calculation only through$\mu\$, so we generally do not expect that it will influence them much. It is important though to highlight that these are two different usages of zinbwave, namely we usually use the K factors for unsupervised analyses when we do not know a priori the sample groups, while we use the observational weights for differential analysis in supervised problems (where we do know the groups). In the latter case, it is better to directly specify the groups in the X design matrix rather than letting zinbwave estimate the factors in W.

• Should I filter genes for the dimensionality reduction step?

We usually recommend this because it is both computationally faster and more stable, e.g., many lowly expressed genes may only contribute noise and not much signal. Of course, you want to use all genes when computing the observational weights because you need a vector of weights for each gene that enters the DESeq2 analysis.

• How should I decide on a value for epsilon?

This is probably the less obvious choice in the zinbwave package. Generally, high values are better than low values as epsilon is used to stabilize the estimates. Your observation is correct, the "optimal" value of epsilon depends on both the number of genes and factors, because as the number of parameters (that depend on the number of genes, samples, covariates, and factors) increases we need a higher value of epsilon.

I hope this is useful to clarify at least some of your questions. Please, let me know if I was unclear and I will try to give more details.

Best, Davide

0
Entering edit mode

Thank you for your quick and detailed response. Really helpful!

• To answer your first question about how the dispersion estimates change with zero inflation taken into account: It looks like it makes a big difference. The maximum dispersion estimates are around 5, as opposed to 50 earlier:

• I ran zinbAIC and zinbBIC to help decide on a value for K. AIC keeps falling with increasing K, but BIC reaches a minimum around 3-4, so I've decided to stick to 4 as a compromise. I will need to do some reading about these criteria to better understand them!

• Thanks for the clarification on gene filtering. Certainly when including all genes for dimensionality reduction, it takes an age and some samples appear to drift away from the majority, even though nothing in the initial QC flagged them as potential outliers.

• Could I ask a follow-up question on this? My data were collected over a series of batches, and a subset of those batches show some separation when plotting latent factors. Rather than including the batch identifier in my DESeq2 design, would I be better off including the vector of the latent factor that demonstrates this batch difference?
• For epsilon, I have compared dispersion estimates for the default option, 1e6 and 1e12. I will probably go with 1e6, based on your advice to increase epsilon:

# 1e12:

Thank you again!

0
Entering edit mode

Yes, I think your choices of K and epsilon make sense based on the plots. And it seems that the weights are indeed beneficial!

As for your follow-up question, my guess is that including the batch variable directly in the model should be preferable since you have that information. I would use the latent factors only in case the batch IDs were not available or somewhat not trustworthy.

0
Entering edit mode

Interesting, and good to know, thank you.

0
Entering edit mode

Agree with Davide's assessment above.