An attempt to better understand DESeq2 dispersion estimation
1
1
Entering edit mode
Sam ▴ 10
@sam-21502
Last seen 1 day ago
Jerusalem

Why is the dispersion high for data which yields the following PCA? enter image description here

As far as I have understood DESeq2 vignette If I have multiple groups, should I run all together or split into pairs of groups? the main thing which influences the dispersion is within-group variation.

However, for this PCA the dispersion is very high but it is only the between-group variation that is (very) high, not within group variation.

DESeq2 • 1.6k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 13 hours ago
United States

I would run these all together. The within-group variation does not appear like the example in the vignette.

ADD COMMENT
0
Entering edit mode

Is my understanding correct that dispersion should mainly be influenced by within-group-variation (as in the example in the vignette) and not by between-group-variation? This is what the vignette (and the answer above) seems to say. However, when I look at some gene values & dispersion in this example, I get very high levels of dispersion, and they seem to be caused by between-group-variation.

as.data.frame(mcols(dds))["ENSMUSG00000021961",]


                          Symbol baseMean  baseVar allZero dispGeneEst dispGeneIter  dispFit dispersion dispIter
ENSMUSG00000021961 4930578I06Rik 435.7912 219388.8   FALSE    10.54408           10 1.406939   8.938508       10
                   dispOutlier  dispMAP
ENSMUSG00000021961       FALSE 8.938508

normc["ENSMUSG00000021961",]
                   A A A AB AB AB        Q        Q        Q       QB       QB      QB               name
ENSMUSG00000021961 0 0 0  0  0  0 977.3829 795.7892 794.6793 1156.702 722.5705 782.371 ENSMUSG00000021961


sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_IL.UTF-8       LC_NUMERIC=C               LC_TIME=en_IL.UTF-8        LC_COLLATE=en_IL.UTF-8    
 [5] LC_MONETARY=en_IL.UTF-8    LC_MESSAGES=en_IL.UTF-8    LC_PAPER=en_IL.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_IL.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggrepel_0.9.1               stringr_1.4.0               RColorBrewer_1.1-2          pheatmap_1.0.12            
 [5] ggplot2_3.3.3               DESeq2_1.30.1               SummarizedExperiment_1.20.0 Biobase_2.50.0             
 [9] MatrixGenerics_1.2.1        matrixStats_0.58.0          GenomicRanges_1.42.0        GenomeInfoDb_1.26.2        
[13] IRanges_2.24.1              S4Vectors_0.28.1            BiocGenerics_0.36.0        

loaded via a namespace (and not attached):
 [1] locfit_1.5-9.4         Rcpp_1.0.6             lattice_0.20-41        digest_0.6.27         
 [5] assertthat_0.2.1       utf8_1.1.4             R6_2.5.0               RSQLite_2.2.3         
 [9] httr_1.4.2             pillar_1.5.0           zlibbioc_1.36.0        rlang_0.4.10          
[13] annotate_1.68.0        blob_1.2.1             Matrix_1.3-2           labeling_0.4.2        
[17] splines_4.0.4          BiocParallel_1.24.1    geneplotter_1.68.0     RCurl_1.98-1.2        
[21] bit_4.0.4              munsell_0.5.0          tinytex_0.29           DelayedArray_0.16.1   
[25] compiler_4.0.4         xfun_0.21              pkgconfig_2.0.3        tidyselect_1.1.0      
[29] tibble_3.0.6           GenomeInfoDbData_1.2.4 XML_3.99-0.5           fansi_0.4.2           
[33] withr_2.4.1            crayon_1.4.1           dplyr_1.0.4            bitops_1.0-6          
[37] xtable_1.8-4           gtable_0.3.0           lifecycle_1.0.0        DBI_1.1.1             
[41] magrittr_2.0.1         scales_1.1.1           stringi_1.5.3          cachem_1.0.4          
[45] farver_2.0.3           XVector_0.30.0         genefilter_1.72.1      ellipsis_0.3.1        
[49] vctrs_0.3.6            generics_0.1.0         tools_4.0.4            bit64_4.0.5           
[53] glue_1.4.2             purrr_0.3.4            fastmap_1.1.0          survival_3.2-7        
[57] AnnotationDbi_1.52.0   colorspace_2.0-0       memoise_2.0.0 
ADD REPLY
0
Entering edit mode

Dispersion depends on the design, can you show design(dds) and colData(dds)?

ADD REPLY
0
Entering edit mode

You are right, I have chosen design=~1. This explains the dispersion plot.

I understand this to be a mistake. Now it seems to me that even for QC figures, one should use the design one intends to use for DE. One should simply set blind=TRUE as an rlog parameter (for the sake of QC).

ADD REPLY

Login before adding your answer.

Traffic: 543 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6