Questions

Question

ZINB-WaVE-DESeq2 with Bulk Data: Choice of Epsilon value, K and gene number

0

Entering edit mode

rowcyclecamp • 0

@rowcyclecamp-16750

Last seen 3.7 years ago

United Kingdom

Hello,

My questions are about using ZINB-WaVE.

I have written out the main steps I took to reach this point. If you don't have the time to read it, my questions are written under the heading below. Thank you!

Questions

How should I decide on a value for K when running zinbwave()? Will this impact observational weights when using DESeq2?
- I have so far tested K=2 and K=3, based on principal components analysis identifying 2 major sources of variation (shown below). There are 4 different biological conditions and 1 technical covariate, however, so should K be set to reflect this?

Should I filter genes for the dimensionality reduction step?
- I have tested the top 1000 most variable genes and the entire set of 16000. If I use 16000 genes, I need a higher epsilon value (16000 instead of 1000) to achieve a similar looking representation.
- For calculating observational weights for DESeq2, I am guessing I will use my entire set of 16000 genes?
How should I decide on a value for epsilon?
- I have tested epsilon values of 1000, 16000, 1e6, 1e12. The impact of changing epsilon seems to depend heavily on the choice of K and the number of genes used.

Introduction

I am working with a bulk RNA-Seq dataset in which I used translating ribosome affinity purification to extract ribosome-bound RNA from a specific cell type in different regions of mouse brain. The library was prepared using an ultra low input protocol, as I could reliably obtain ~100 pg per sample (in some regions, this cell type is very rare). Each sample was sequenced to a depth of ~25 million reads. There are 120 samples in total, collected over 8 batches (every biological condition was collected in each batch).

I suspect there may be zero inflation in the count values of all samples, as I try to demonstrate below.

Zero inflation: Calculating dropout rate

To check for zero-inflation, I did the following:

# dds is a DESeq2 object created from a salmon > tximeta > DESeq2 workflow

# filter genes with less than 5 counts in at least 10 samples
dds <- dds[rowSums(counts(dds) >= 5) >= 10,]

# check dimensions
dim(dds)
# ~16000 genes, 120 samples

## calculate dropout
# extract counts (as I could not set 0 values to NA in the DESeq object)
cts <- counts(dds)
# set 0 values to NA
cts[cts == 0] <- NA
# calculate per gene mean count (excluding zero counts) and the proportion of zeros
dropouts <- tibble(mean = rowMeans(cts, na.rm = T), 
       dropouts = rowSums(is.na(cts))/ncol(cts))

# quantile of dropouts
quantile(dropouts$dropouts, seq(0.1, 1, 0.1))
#        10%         20%         30%         40%         50%         60%         70% 
#0.008333333 0.075000000 0.183333333 0.300000000 0.416666667 0.533333333 0.650000000 
#        80%         90%        100% 
#0.750000000 0.841666667 0.916666667 

# plot mean vs dropouts
dropouts %>%
  ggplot(aes(x = log10(mean), 
             y = dropouts)) +
  geom_point(alpha = 0.1)

mean_vs_dropout_ulinput

For comparison, below is the same plot for another dataset in which I used Smart-Seq2 with a larger input amount of RNA of the same type of samples:

mean_vs_dropout_ss2

I then went on to look at the dispersion estimates.

Dispersion

# calculate size factors and dispersion
dds <- estimateSizeFactors(dds, "poscounts")
dds <- estimateDispersions(dds, fitType = "local")

# plot dispersion estimates
plotDispEsts(dds)

dispests_ulinput

My interpretation of this plot is that the dispersion of lower count genes (<1000 counts) is large; large enough to weaken my ability to detect differentially expressed genes within this count range. Is this correct?

PCA

I performed principal components analysis to look at how the samples differ. I used the top 1000 most variable genes. I expected 1 major technical and 1 major biological sample covariate to cause separation and this was confirmed. There is a large drop in variance explained by subsequent PC's, although they may represent variation due to other biological conditions tested (not shown).

# subsetting 1000 most variable genes
vars <- counts(dds) %>% rowVars
names(vars) <- rownames(dds)
vars <- sort(vars, decreasing = TRUE)
zidds <- dds[names(vars)[1:1000],]

# recalculate size factors and dispersion - is this necessary?
zidds <- estimateSizeFactors(zidds, "poscounts")
zidds <- estimateDispersions(zidds, fitType = "local")

# variance stabilising transformation
vsd <- vst(zidds)

# make pea object
pca <- prcomp(t(assay(vsd)))
# add colData
pca$x <- pca$x %>%
      as_tibble(rownames = identifier) %>%
      inner_join(colData(zidds), copy = TRUE)

# plot PC1 vs PC2
pca$x %>%
  ggplot(aes(x = PC1,
             y = PC2,
             colour = biological, 
             shape = technical)) +
  geom_point()

PCA

# calculate the variation per component
var_explained0 <- pca$sdev^2/sum(pca$sdev^2) #Calculate PC variance
names(var_explained0) <- colnames(pca$x)
var_explained <- as_tibble(var_explained0,
                            rownames = "PC") %>%
mutate(var = paste0((round(var_explained0*100)), ' %', sep = '')) %>%
select(PC, var)

pca$var_explained <- var_explained

# scree splot
pca$var_explained %>%
  slice_head(n=10) %>%
  mutate(PC = factor(PC, levels = c("PC1", "PC2", "PC3", "PC4", "PC5", 
                                    "PC6", "PC7", "PC8", "PC9", "PC10")), 
         var = as.integer(str_extract(var, pattern = "[:digit:]+"))) %>%
  ggplot(aes(x = PC, 
             y = var)) +
  geom_col()

Scree plot

ZINB-WaVE

I have been following the ZINB-WaVE vignette and the deseq2-zinbwave workflow to try to account for zero inflation.

I first want to use ZINB-WaVE's dimensionality reduction method to see how adjusting for dropout influences the low dimension representation. Below is an example of my working.

# ZINB dimensionality reduction with 1000 most variable genes, K = 2, epsilon default

zinb <- zinbwave(zidds, 
                K = 2)

W <- reducedDim(zinb)

as_tibble(W, rownames = "sample_name") %>%
  inner_join(colData(zinb), copy = TRUE) %>%
    ggplot(aes(W1, 
                      W2, 
                      colour=biological, 
                      shape=technical)) + 
    geom_point()

zinbwave


sessionInfo( )

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] parallel  stats4    grid      stats     graphics  grDevices datasets 
 [8] utils     methods   base     

other attached packages:
 [1] edgeR_3.32.1                limma_3.46.0               
 [3] zinbwave_1.12.0             SingleCellExperiment_1.12.0
 [5] apeglm_1.12.0               biomaRt_2.46.1             
 [7] BiocParallel_1.24.1         org.Mm.eg.db_3.12.0        
 [9] GenomicFeatures_1.42.1      AnnotationDbi_1.52.0       
[11] rhdf5_2.34.0                tximport_1.18.0            
[13] DESeq2_1.30.0               SummarizedExperiment_1.20.0
[15] Biobase_2.50.0              MatrixGenerics_1.2.0       
[17] matrixStats_0.57.0          GenomicRanges_1.42.0       
[19] GenomeInfoDb_1.26.2         IRanges_2.24.1             
[21] S4Vectors_0.28.1            BiocGenerics_0.36.0        
[23] tximeta_1.8.3               broom_0.7.3                
[25] Rtsne_0.15                  factoextra_1.0.7           
[27] gwasrapidd_0.99.11          ggvenn_0.1.8               
[29] msigdbr_7.2.1               gprofiler2_0.2.0           
[31] scales_1.1.1                ggrepel_0.9.1              
[33] openxlsx_4.2.3              readxl_1.3.1               
[35] kableExtra_1.3.1            ggthemes_4.2.4             
[37] car_3.0-10                  carData_3.0-4              
[39] ggpubr_0.4.0                ggsci_2.9                  
[41] cowplot_1.1.1               pander_0.6.3               
[43] forcats_0.5.0               stringr_1.4.0              
[45] dplyr_1.0.3                 purrr_0.3.4                
[47] readr_1.4.0                 tidyr_1.1.2                
[49] tibble_3.0.5                ggplot2_3.3.3              
[51] tidyverse_1.3.0             devtools_2.3.2             
[53] usethis_2.0.0              

loaded via a namespace (and not attached):
  [1] rappdirs_0.3.1                rtracklayer_1.50.0           
  [3] coda_0.19-4                   bit64_4.0.5                  
  [5] knitr_1.30                    DelayedArray_0.16.1          
  [7] hwriter_1.3.2                 data.table_1.13.6            
  [9] RCurl_1.98-1.2                AnnotationFilter_1.14.0      
 [11] generics_0.1.0                callr_3.5.1                  
 [13] RSQLite_2.2.3                 bit_4.0.4                    
 [15] webshot_0.5.2                 xml2_1.3.2                   
 [17] lubridate_1.7.9.2             httpuv_1.5.5                 
 [19] assertthat_0.2.1              xfun_0.20                    
 [21] hms_1.0.0                     evaluate_0.14                
 [23] promises_1.1.1                fansi_0.4.2                  
 [25] progress_1.2.2                dbplyr_2.0.0                 
 [27] DBI_1.1.1                     geneplotter_1.68.0           
 [29] htmlwidgets_1.5.3             ellipsis_0.3.1               
 [31] backports_1.2.1               FactoMineR_2.4               
 [33] bookdown_0.21                 annotate_1.68.0              
 [35] vctrs_0.3.6                   remotes_2.2.0                
 [37] ensembldb_2.14.0              abind_1.4-5                  
 [39] cachem_1.0.1                  withr_2.4.1                  
 [41] bdsmatrix_1.3-4               GenomicAlignments_1.26.0     
 [43] prettyunits_1.1.1             softImpute_1.4               
 [45] cluster_2.1.0                 lazyeval_0.2.2               
 [47] crayon_1.3.4                  genefilter_1.72.1            
 [49] pkgconfig_2.0.3               labeling_0.4.2               
 [51] pkgload_1.1.0                 ProtGenerics_1.22.0          
 [53] rlang_0.4.10                  lifecycle_0.2.0              
 [55] BiocFileCache_1.14.0          modelr_0.1.8                 
 [57] AnnotationHub_2.22.0          cellranger_1.1.0             
 [59] rprojroot_2.0.2               Matrix_1.3-2                 
 [61] Rhdf5lib_1.12.0               zoo_1.8-8                    
 [63] reprex_1.0.0                  processx_3.4.5               
 [65] png_0.1-7                     viridisLite_0.3.0            
 [67] bitops_1.0-6                  rhdf5filters_1.2.0           
 [69] Biostrings_2.58.0             blob_1.2.1                   
 [71] ShortRead_1.48.0              jpeg_0.1-8.1                 
 [73] rstatix_0.6.0                 ggsignif_0.6.0               
 [75] leaps_3.1                     memoise_2.0.0                
 [77] magrittr_2.0.1                plyr_1.8.6                   
 [79] zlibbioc_1.36.0               compiler_4.0.3               
 [81] tinytex_0.29                  bbmle_1.0.23.1               
 [83] RColorBrewer_1.1-2            Rsamtools_2.6.0              
 [85] cli_2.2.0                     XVector_0.30.0               
 [87] ps_1.5.0                      MASS_7.3-53                  
 [89] tidyselect_1.1.0              stringi_1.5.3                
 [91] highr_0.8                     emdbook_1.3.12               
 [93] yaml_2.2.1                    askpass_1.1                  
 [95] locfit_1.5-9.4                latticeExtra_0.6-29          
 [97] tools_4.0.3                   rio_0.5.16                   
 [99] rstudioapi_0.13               foreign_0.8-81               
[101] scatterplot3d_0.3-41          farver_2.0.3                 
[103] digest_0.6.27                 BiocManager_1.30.10          
[105] shiny_1.6.0                   Rcpp_1.0.6                   
[107] BiocVersion_3.12.0            later_1.1.0.1                
[109] ngsReports_1.6.1              httr_1.4.2                   
[111] ggdendro_0.1.22               colorspace_2.0-0             
[113] rvest_0.3.6                   XML_3.99-0.5                 
[115] fs_1.5.0                      splines_4.0.3                
[117] renv_0.12.5                   plotly_4.9.3                 
[119] sessioninfo_1.1.1             xtable_1.8-4                 
[121] jsonlite_1.7.2                flashClust_1.01-2            
[123] testthat_3.0.1                R6_2.5.0                     
[125] pillar_1.4.7                  htmltools_0.5.1.1            
[127] mime_0.9                      glue_1.4.2                   
[129] fastmap_1.1.0                 DT_0.17                      
[131] interactiveDisplayBase_1.28.0 pkgbuild_1.2.0               
[133] mvtnorm_1.1-1                 lattice_0.20-41              
[135] numDeriv_2016.8-1.1           curl_4.3                     
[137] zip_2.1.1                     openssl_1.4.3                
[139] survival_3.2-7                rmarkdown_2.6                
[141] desc_1.2.0                    munsell_0.5.0                
[143] GenomeInfoDbData_1.2.4        haven_2.3.1                  
[145] reshape2_1.4.4                gtable_0.3.0

DESeq2 ZINB zinbwave deseq • 2.1k views

ADD COMMENT • link updated 3.9 years ago by Michael Love 43k • written 3.9 years ago by rowcyclecamp • 0

score 0 · Answer 1 · 2021-02-04

Hi,

I will respond to the ZINB-WaVE specific questions, but I will let Mike or other DESeq2 experts comment on the dispersion plot, as I'm not sure how "unusual" that look and if it justifies the use of weights. A quick aside: I guess that a missing piece to further comment on that is, how do the dispersion estimates change once you take zero inflation into account with zinbwave weights? Do the weights lead to a better plot?

Now onto the questions:

How should I decide on a value for K when running zinbwave()? Will this impact observational weights when using DESeq2?

There is no set-in-stone rule to choose K for zinbwave.What we suggest is to look at the factors and see if they explain interesting variation in the data or not. We do have however a couple of functions that can help you choose K, namely zinbAIC and zinbBIC, which will compute the Akaike and Baysian Information Criteria, respectively. Note that these need to be applied to the output of zinbFit rather than zinbwave. The smaller AIC and BIC the better, you could try several values of K and choose the one that minimizes AIC and/or BIC.

The value of K enters into the weights calculation only through $\mu$, so we generally do not expect that it will influence them much. It is important though to highlight that these are two different usages of zinbwave, namely we usually use the K factors for unsupervised analyses when we do not know a priori the sample groups, while we use the observational weights for differential analysis in supervised problems (where we do know the groups). In the latter case, it is better to directly specify the groups in the X design matrix rather than letting zinbwave estimate the factors in W.

Should I filter genes for the dimensionality reduction step?

We usually recommend this because it is both computationally faster and more stable, e.g., many lowly expressed genes may only contribute noise and not much signal. Of course, you want to use all genes when computing the observational weights because you need a vector of weights for each gene that enters the DESeq2 analysis.

How should I decide on a value for epsilon?

This is probably the less obvious choice in the zinbwave package. Generally, high values are better than low values as epsilon is used to stabilize the estimates. Your observation is correct, the "optimal" value of epsilon depends on both the number of genes and factors, because as the number of parameters (that depend on the number of genes, samples, covariates, and factors) increases we need a higher value of epsilon.

I hope this is useful to clarify at least some of your questions. Please, let me know if I was unclear and I will try to give more details.

Best, Davide