Question

An attempt to better understand DESeq2 dispersion estimation

1

Entering edit mode

Sam ▴ 10

@sam-21502

Last seen 1 day ago

Jerusalem

Why is the dispersion high for data which yields the following PCA? enter image description here

As far as I have understood DESeq2 vignette If I have multiple groups, should I run all together or split into pairs of groups? the main thing which influences the dispersion is within-group variation.

However, for this PCA the dispersion is very high but it is only the between-group variation that is (very) high, not within group variation.

DESeq2 • 1.6k views

ADD COMMENT • link 3.1 years ago Sam ▴ 10

score 0 · Answer 1 · 2021-03-05

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 13 hours ago

United States

I would run these all together. The within-group variation does not appear like the example in the vignette.

ADD COMMENT • link 3.1 years ago Michael Love 41k

0

Entering edit mode

Is my understanding correct that dispersion should mainly be influenced by within-group-variation (as in the example in the vignette) and not by between-group-variation? This is what the vignette (and the answer above) seems to say. However, when I look at some gene values & dispersion in this example, I get very high levels of dispersion, and they seem to be caused by between-group-variation.

as.data.frame(mcols(dds))["ENSMUSG00000021961",]


                          Symbol baseMean  baseVar allZero dispGeneEst dispGeneIter  dispFit dispersion dispIter
ENSMUSG00000021961 4930578I06Rik 435.7912 219388.8   FALSE    10.54408           10 1.406939   8.938508       10
                   dispOutlier  dispMAP
ENSMUSG00000021961       FALSE 8.938508

normc["ENSMUSG00000021961",]
                   A A A AB AB AB        Q        Q        Q       QB       QB      QB               name
ENSMUSG00000021961 0 0 0  0  0  0 977.3829 795.7892 794.6793 1156.702 722.5705 782.371 ENSMUSG00000021961


sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_IL.UTF-8       LC_NUMERIC=C               LC_TIME=en_IL.UTF-8        LC_COLLATE=en_IL.UTF-8    
 [5] LC_MONETARY=en_IL.UTF-8    LC_MESSAGES=en_IL.UTF-8    LC_PAPER=en_IL.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_IL.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggrepel_0.9.1               stringr_1.4.0               RColorBrewer_1.1-2          pheatmap_1.0.12            
 [5] ggplot2_3.3.3               DESeq2_1.30.1               SummarizedExperiment_1.20.0 Biobase_2.50.0             
 [9] MatrixGenerics_1.2.1        matrixStats_0.58.0          GenomicRanges_1.42.0        GenomeInfoDb_1.26.2        
[13] IRanges_2.24.1              S4Vectors_0.28.1            BiocGenerics_0.36.0        

loaded via a namespace (and not attached):
 [1] locfit_1.5-9.4         Rcpp_1.0.6             lattice_0.20-41        digest_0.6.27         
 [5] assertthat_0.2.1       utf8_1.1.4             R6_2.5.0               RSQLite_2.2.3         
 [9] httr_1.4.2             pillar_1.5.0           zlibbioc_1.36.0        rlang_0.4.10          
[13] annotate_1.68.0        blob_1.2.1             Matrix_1.3-2           labeling_0.4.2        
[17] splines_4.0.4          BiocParallel_1.24.1    geneplotter_1.68.0     RCurl_1.98-1.2        
[21] bit_4.0.4              munsell_0.5.0          tinytex_0.29           DelayedArray_0.16.1   
[25] compiler_4.0.4         xfun_0.21              pkgconfig_2.0.3        tidyselect_1.1.0      
[29] tibble_3.0.6           GenomeInfoDbData_1.2.4 XML_3.99-0.5           fansi_0.4.2           
[33] withr_2.4.1            crayon_1.4.1           dplyr_1.0.4            bitops_1.0-6          
[37] xtable_1.8-4           gtable_0.3.0           lifecycle_1.0.0        DBI_1.1.1             
[41] magrittr_2.0.1         scales_1.1.1           stringi_1.5.3          cachem_1.0.4          
[45] farver_2.0.3           XVector_0.30.0         genefilter_1.72.1      ellipsis_0.3.1        
[49] vctrs_0.3.6            generics_0.1.0         tools_4.0.4            bit64_4.0.5           
[53] glue_1.4.2             purrr_0.3.4            fastmap_1.1.0          survival_3.2-7        
[57] AnnotationDbi_1.52.0   colorspace_2.0-0       memoise_2.0.0

ADD REPLY • link 3.1 years ago Sam ▴ 10

0

Entering edit mode

Dispersion depends on the design, can you show design(dds) and colData(dds)?

ADD REPLY • link 3.1 years ago Michael Love 41k

0

Entering edit mode

You are right, I have chosen design=~1. This explains the dispersion plot.

I understand this to be a mistake. Now it seems to me that even for QC figures, one should use the design one intends to use for DE. One should simply set blind=TRUE as an rlog parameter (for the sake of QC).

ADD REPLY • link 3.1 years ago Sam ▴ 10