ZINB-WaVE-DESeq2 with Bulk Data: Choice of Epsilon value, K and gene number
1
0
Entering edit mode
@rowcyclecamp-16750
Last seen 3.7 years ago
United Kingdom

Hello,

My questions are about using ZINB-WaVE.

I have written out the main steps I took to reach this point. If you don't have the time to read it, my questions are written under the heading below. Thank you!

Questions

  • How should I decide on a value for K when running zinbwave()? Will this impact observational weights when using DESeq2?
    • I have so far tested K=2 and K=3, based on principal components analysis identifying 2 major sources of variation (shown below). There are 4 different biological conditions and 1 technical covariate, however, so should K be set to reflect this?
  • Should I filter genes for the dimensionality reduction step?

    • I have tested the top 1000 most variable genes and the entire set of 16000. If I use 16000 genes, I need a higher epsilon value (16000 instead of 1000) to achieve a similar looking representation.
    • For calculating observational weights for DESeq2, I am guessing I will use my entire set of 16000 genes?
  • How should I decide on a value for epsilon?

    • I have tested epsilon values of 1000, 16000, 1e6, 1e12. The impact of changing epsilon seems to depend heavily on the choice of K and the number of genes used.

Introduction

I am working with a bulk RNA-Seq dataset in which I used translating ribosome affinity purification to extract ribosome-bound RNA from a specific cell type in different regions of mouse brain. The library was prepared using an ultra low input protocol, as I could reliably obtain ~100 pg per sample (in some regions, this cell type is very rare). Each sample was sequenced to a depth of ~25 million reads. There are 120 samples in total, collected over 8 batches (every biological condition was collected in each batch).

I suspect there may be zero inflation in the count values of all samples, as I try to demonstrate below.

Zero inflation: Calculating dropout rate

To check for zero-inflation, I did the following:

# dds is a DESeq2 object created from a salmon > tximeta > DESeq2 workflow

# filter genes with less than 5 counts in at least 10 samples
dds <- dds[rowSums(counts(dds) >= 5) >= 10,]

# check dimensions
dim(dds)
# ~16000 genes, 120 samples

## calculate dropout
# extract counts (as I could not set 0 values to NA in the DESeq object)
cts <- counts(dds)
# set 0 values to NA
cts[cts == 0] <- NA
# calculate per gene mean count (excluding zero counts) and the proportion of zeros
dropouts <- tibble(mean = rowMeans(cts, na.rm = T), 
       dropouts = rowSums(is.na(cts))/ncol(cts))

# quantile of dropouts
quantile(dropouts$dropouts, seq(0.1, 1, 0.1))
#        10%         20%         30%         40%         50%         60%         70% 
#0.008333333 0.075000000 0.183333333 0.300000000 0.416666667 0.533333333 0.650000000 
#        80%         90%        100% 
#0.750000000 0.841666667 0.916666667 

# plot mean vs dropouts
dropouts %>%
  ggplot(aes(x = log10(mean), 
             y = dropouts)) +
  geom_point(alpha = 0.1)

mean_vs_dropout_ulinput

For comparison, below is the same plot for another dataset in which I used Smart-Seq2 with a larger input amount of RNA of the same type of samples:

mean_vs_dropout_ss2

I then went on to look at the dispersion estimates.

Dispersion

# calculate size factors and dispersion
dds <- estimateSizeFactors(dds, "poscounts")
dds <- estimateDispersions(dds, fitType = "local")

# plot dispersion estimates
plotDispEsts(dds)

dispests_ulinput

My interpretation of this plot is that the dispersion of lower count genes (<1000 counts) is large; large enough to weaken my ability to detect differentially expressed genes within this count range. Is this correct?

PCA

I performed principal components analysis to look at how the samples differ. I used the top 1000 most variable genes. I expected 1 major technical and 1 major biological sample covariate to cause separation and this was confirmed. There is a large drop in variance explained by subsequent PC's, although they may represent variation due to other biological conditions tested (not shown).

# subsetting 1000 most variable genes
vars <- counts(dds) %>% rowVars
names(vars) <- rownames(dds)
vars <- sort(vars, decreasing = TRUE)
zidds <- dds[names(vars)[1:1000],]

# recalculate size factors and dispersion - is this necessary?
zidds <- estimateSizeFactors(zidds, "poscounts")
zidds <- estimateDispersions(zidds, fitType = "local")

# variance stabilising transformation
vsd <- vst(zidds)

# make pea object
pca <- prcomp(t(assay(vsd)))
# add colData
pca$x <- pca$x %>%
      as_tibble(rownames = identifier) %>%
      inner_join(colData(zidds), copy = TRUE)

# plot PC1 vs PC2
pca$x %>%
  ggplot(aes(x = PC1,
             y = PC2,
             colour = biological, 
             shape = technical)) +
  geom_point()

PCA

# calculate the variation per component
var_explained0 <- pca$sdev^2/sum(pca$sdev^2) #Calculate PC variance
names(var_explained0) <- colnames(pca$x)
var_explained <- as_tibble(var_explained0,
                            rownames = "PC") %>%
mutate(var = paste0((round(var_explained0*100)), ' %', sep = '')) %>%
select(PC, var)

pca$var_explained <- var_explained

# scree splot
pca$var_explained %>%
  slice_head(n=10) %>%
  mutate(PC = factor(PC, levels = c("PC1", "PC2", "PC3", "PC4", "PC5", 
                                    "PC6", "PC7", "PC8", "PC9", "PC10")), 
         var = as.integer(str_extract(var, pattern = "[:digit:]+"))) %>%
  ggplot(aes(x = PC, 
             y = var)) +
  geom_col()

Scree plot

ZINB-WaVE

I have been following the ZINB-WaVE vignette and the deseq2-zinbwave workflow to try to account for zero inflation.

I first want to use ZINB-WaVE's dimensionality reduction method to see how adjusting for dropout influences the low dimension representation. Below is an example of my working.

# ZINB dimensionality reduction with 1000 most variable genes, K = 2, epsilon default

zinb <- zinbwave(zidds, 
                K = 2)

W <- reducedDim(zinb)

as_tibble(W, rownames = "sample_name") %>%
  inner_join(colData(zinb), copy = TRUE) %>%
    ggplot(aes(W1, 
                      W2, 
                      colour=biological, 
                      shape=technical)) + 
    geom_point()

zinbwave


sessionInfo( )

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] parallel  stats4    grid      stats     graphics  grDevices datasets 
 [8] utils     methods   base     

other attached packages:
 [1] edgeR_3.32.1                limma_3.46.0               
 [3] zinbwave_1.12.0             SingleCellExperiment_1.12.0
 [5] apeglm_1.12.0               biomaRt_2.46.1             
 [7] BiocParallel_1.24.1         org.Mm.eg.db_3.12.0        
 [9] GenomicFeatures_1.42.1      AnnotationDbi_1.52.0       
[11] rhdf5_2.34.0                tximport_1.18.0            
[13] DESeq2_1.30.0               SummarizedExperiment_1.20.0
[15] Biobase_2.50.0              MatrixGenerics_1.2.0       
[17] matrixStats_0.57.0          GenomicRanges_1.42.0       
[19] GenomeInfoDb_1.26.2         IRanges_2.24.1             
[21] S4Vectors_0.28.1            BiocGenerics_0.36.0        
[23] tximeta_1.8.3               broom_0.7.3                
[25] Rtsne_0.15                  factoextra_1.0.7           
[27] gwasrapidd_0.99.11          ggvenn_0.1.8               
[29] msigdbr_7.2.1               gprofiler2_0.2.0           
[31] scales_1.1.1                ggrepel_0.9.1              
[33] openxlsx_4.2.3              readxl_1.3.1               
[35] kableExtra_1.3.1            ggthemes_4.2.4             
[37] car_3.0-10                  carData_3.0-4              
[39] ggpubr_0.4.0                ggsci_2.9                  
[41] cowplot_1.1.1               pander_0.6.3               
[43] forcats_0.5.0               stringr_1.4.0              
[45] dplyr_1.0.3                 purrr_0.3.4                
[47] readr_1.4.0                 tidyr_1.1.2                
[49] tibble_3.0.5                ggplot2_3.3.3              
[51] tidyverse_1.3.0             devtools_2.3.2             
[53] usethis_2.0.0              

loaded via a namespace (and not attached):
  [1] rappdirs_0.3.1                rtracklayer_1.50.0           
  [3] coda_0.19-4                   bit64_4.0.5                  
  [5] knitr_1.30                    DelayedArray_0.16.1          
  [7] hwriter_1.3.2                 data.table_1.13.6            
  [9] RCurl_1.98-1.2                AnnotationFilter_1.14.0      
 [11] generics_0.1.0                callr_3.5.1                  
 [13] RSQLite_2.2.3                 bit_4.0.4                    
 [15] webshot_0.5.2                 xml2_1.3.2                   
 [17] lubridate_1.7.9.2             httpuv_1.5.5                 
 [19] assertthat_0.2.1              xfun_0.20                    
 [21] hms_1.0.0                     evaluate_0.14                
 [23] promises_1.1.1                fansi_0.4.2                  
 [25] progress_1.2.2                dbplyr_2.0.0                 
 [27] DBI_1.1.1                     geneplotter_1.68.0           
 [29] htmlwidgets_1.5.3             ellipsis_0.3.1               
 [31] backports_1.2.1               FactoMineR_2.4               
 [33] bookdown_0.21                 annotate_1.68.0              
 [35] vctrs_0.3.6                   remotes_2.2.0                
 [37] ensembldb_2.14.0              abind_1.4-5                  
 [39] cachem_1.0.1                  withr_2.4.1                  
 [41] bdsmatrix_1.3-4               GenomicAlignments_1.26.0     
 [43] prettyunits_1.1.1             softImpute_1.4               
 [45] cluster_2.1.0                 lazyeval_0.2.2               
 [47] crayon_1.3.4                  genefilter_1.72.1            
 [49] pkgconfig_2.0.3               labeling_0.4.2               
 [51] pkgload_1.1.0                 ProtGenerics_1.22.0          
 [53] rlang_0.4.10                  lifecycle_0.2.0              
 [55] BiocFileCache_1.14.0          modelr_0.1.8                 
 [57] AnnotationHub_2.22.0          cellranger_1.1.0             
 [59] rprojroot_2.0.2               Matrix_1.3-2                 
 [61] Rhdf5lib_1.12.0               zoo_1.8-8                    
 [63] reprex_1.0.0                  processx_3.4.5               
 [65] png_0.1-7                     viridisLite_0.3.0            
 [67] bitops_1.0-6                  rhdf5filters_1.2.0           
 [69] Biostrings_2.58.0             blob_1.2.1                   
 [71] ShortRead_1.48.0              jpeg_0.1-8.1                 
 [73] rstatix_0.6.0                 ggsignif_0.6.0               
 [75] leaps_3.1                     memoise_2.0.0                
 [77] magrittr_2.0.1                plyr_1.8.6                   
 [79] zlibbioc_1.36.0               compiler_4.0.3               
 [81] tinytex_0.29                  bbmle_1.0.23.1               
 [83] RColorBrewer_1.1-2            Rsamtools_2.6.0              
 [85] cli_2.2.0                     XVector_0.30.0               
 [87] ps_1.5.0                      MASS_7.3-53                  
 [89] tidyselect_1.1.0              stringi_1.5.3                
 [91] highr_0.8                     emdbook_1.3.12               
 [93] yaml_2.2.1                    askpass_1.1                  
 [95] locfit_1.5-9.4                latticeExtra_0.6-29          
 [97] tools_4.0.3                   rio_0.5.16                   
 [99] rstudioapi_0.13               foreign_0.8-81               
[101] scatterplot3d_0.3-41          farver_2.0.3                 
[103] digest_0.6.27                 BiocManager_1.30.10          
[105] shiny_1.6.0                   Rcpp_1.0.6                   
[107] BiocVersion_3.12.0            later_1.1.0.1                
[109] ngsReports_1.6.1              httr_1.4.2                   
[111] ggdendro_0.1.22               colorspace_2.0-0             
[113] rvest_0.3.6                   XML_3.99-0.5                 
[115] fs_1.5.0                      splines_4.0.3                
[117] renv_0.12.5                   plotly_4.9.3                 
[119] sessioninfo_1.1.1             xtable_1.8-4                 
[121] jsonlite_1.7.2                flashClust_1.01-2            
[123] testthat_3.0.1                R6_2.5.0                     
[125] pillar_1.4.7                  htmltools_0.5.1.1            
[127] mime_0.9                      glue_1.4.2                   
[129] fastmap_1.1.0                 DT_0.17                      
[131] interactiveDisplayBase_1.28.0 pkgbuild_1.2.0               
[133] mvtnorm_1.1-1                 lattice_0.20-41              
[135] numDeriv_2016.8-1.1           curl_4.3                     
[137] zip_2.1.1                     openssl_1.4.3                
[139] survival_3.2-7                rmarkdown_2.6                
[141] desc_1.2.0                    munsell_0.5.0                
[143] GenomeInfoDbData_1.2.4        haven_2.3.1                  
[145] reshape2_1.4.4                gtable_0.3.0
DESeq2 ZINB zinbwave deseq • 2.2k views
ADD COMMENT
0
Entering edit mode
davide risso ▴ 980
@davide-risso-5075
Last seen 9 months ago
University of Padova

Hi,

I will respond to the ZINB-WaVE specific questions, but I will let Mike or other DESeq2 experts comment on the dispersion plot, as I'm not sure how "unusual" that look and if it justifies the use of weights. A quick aside: I guess that a missing piece to further comment on that is, how do the dispersion estimates change once you take zero inflation into account with zinbwave weights? Do the weights lead to a better plot?

Now onto the questions:

  • How should I decide on a value for K when running zinbwave()? Will this impact observational weights when using DESeq2?

There is no set-in-stone rule to choose K for zinbwave.What we suggest is to look at the factors and see if they explain interesting variation in the data or not. We do have however a couple of functions that can help you choose K, namely zinbAIC and zinbBIC, which will compute the Akaike and Baysian Information Criteria, respectively. Note that these need to be applied to the output of zinbFit rather than zinbwave. The smaller AIC and BIC the better, you could try several values of K and choose the one that minimizes AIC and/or BIC.

The value of K enters into the weights calculation only through $\mu$, so we generally do not expect that it will influence them much. It is important though to highlight that these are two different usages of zinbwave, namely we usually use the K factors for unsupervised analyses when we do not know a priori the sample groups, while we use the observational weights for differential analysis in supervised problems (where we do know the groups). In the latter case, it is better to directly specify the groups in the X design matrix rather than letting zinbwave estimate the factors in W.

  • Should I filter genes for the dimensionality reduction step?

We usually recommend this because it is both computationally faster and more stable, e.g., many lowly expressed genes may only contribute noise and not much signal. Of course, you want to use all genes when computing the observational weights because you need a vector of weights for each gene that enters the DESeq2 analysis.

  • How should I decide on a value for epsilon?

This is probably the less obvious choice in the zinbwave package. Generally, high values are better than low values as epsilon is used to stabilize the estimates. Your observation is correct, the "optimal" value of epsilon depends on both the number of genes and factors, because as the number of parameters (that depend on the number of genes, samples, covariates, and factors) increases we need a higher value of epsilon.

I hope this is useful to clarify at least some of your questions. Please, let me know if I was unclear and I will try to give more details.

Best, Davide

ADD COMMENT
0
Entering edit mode

Thank you for your quick and detailed response. Really helpful!

  • To answer your first question about how the dispersion estimates change with zero inflation taken into account: It looks like it makes a big difference. The maximum dispersion estimates are around 5, as opposed to 50 earlier:

disp_est_zinb

  • I ran zinbAIC and zinbBIC to help decide on a value for K. AIC keeps falling with increasing K, but BIC reaches a minimum around 3-4, so I've decided to stick to 4 as a compromise. I will need to do some reading about these criteria to better understand them!

aic_bic

  • Thanks for the clarification on gene filtering. Certainly when including all genes for dimensionality reduction, it takes an age and some samples appear to drift away from the majority, even though nothing in the initial QC flagged them as potential outliers.

    • Could I ask a follow-up question on this? My data were collected over a series of batches, and a subset of those batches show some separation when plotting latent factors. Rather than including the batch identifier in my DESeq2 design, would I be better off including the vector of the latent factor that demonstrates this batch difference?
  • For epsilon, I have compared dispersion estimates for the default option, 1e6 and 1e12. I will probably go with 1e6, based on your advice to increase epsilon:

Default:

epsilon_default

1e6:

epsilon_1e6

1e12:

epsilon_1e12

Thank you again!

ADD REPLY
0
Entering edit mode

Yes, I think your choices of K and epsilon make sense based on the plots. And it seems that the weights are indeed beneficial!

As for your follow-up question, my guess is that including the batch variable directly in the model should be preferable since you have that information. I would use the latent factors only in case the batch IDs were not available or somewhat not trustworthy.

ADD REPLY
0
Entering edit mode

Interesting, and good to know, thank you.

ADD REPLY
0
Entering edit mode

Agree with Davide's assessment above.

ADD REPLY

Login before adding your answer.

Traffic: 643 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6