Post does not exist.
DESeq2 on non-RNASeq counts data
1
0
Entering edit mode
vs • 0
@f039be33
Last seen 12 weeks ago
United States

Dispersion plotDispersion plot for counts data

Hi,

I'm using DESeq2 to analyze counts data that doesn't come from standard RNA-seq. In this case, the data consist of artificial sequences expressed in samples, rather than genes.

DESeq2 automatically chose a local regression fit for the dispersion estimates instead of a parametric fit, which makes sense. The dispersion plot looks fine overall, but I noticed that dispersion does not decrease with increasing counts, unlike what is typically seen in gene-level RNA-seq data.

Is this behavior expected for non-gene count data, and does it affect how I should interpret the DESeq2 results?

Thanks!

sessionInfo( )

R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DESeq2_1.48.0               SummarizedExperiment_1.38.1 Biobase_2.68.0             
 [4] MatrixGenerics_1.20.0       matrixStats_1.5.0           GenomicRanges_1.60.0       
 [7] GenomeInfoDb_1.44.0         IRanges_2.42.0              S4Vectors_0.46.0           
[10] BiocGenerics_0.54.0         generics_0.1.3              lubridate_1.9.4            
[13] forcats_1.0.0               stringr_1.5.1               dplyr_1.1.4                
[16] purrr_1.0.4                 readr_2.1.5                 tidyr_1.3.1                
[19] tibble_3.2.1                ggplot2_3.5.2               tidyverse_2.0.0            

loaded via a namespace (and not attached):
 [1] gtable_0.3.6            xfun_0.52               lattice_0.22-7          tzdb_0.5.0             
 [5] vctrs_0.6.5             tools_4.5.1             parallel_4.5.1          pkgconfig_2.0.3        
 [9] Matrix_1.7-3            RColorBrewer_1.1-3      lifecycle_1.0.4         GenomeInfoDbData_1.2.14
[13] compiler_4.5.1          farver_2.1.2            codetools_0.2-20        htmltools_0.5.8.1      
[17] yaml_2.3.10             pillar_1.10.2           crayon_1.5.3            BiocParallel_1.42.0    
[21] DelayedArray_0.34.1     abind_1.4-8             tidyselect_1.2.1        locfit_1.5-9.12        
[25] digest_0.6.37           stringi_1.8.7           labeling_0.4.3          cowplot_1.1.3          
[29] fastmap_1.2.0           grid_4.5.1              cli_3.6.5               SparseArray_1.8.0      
[33] magrittr_2.0.3          S4Arrays_1.8.0          withr_3.0.2             scales_1.4.0           
[37] UCSC.utils_1.4.0        bit64_4.6.0-1           timechange_0.3.0        rmarkdown_2.29         
[41] XVector_0.48.0          httr_1.4.7              bit_4.6.0               hms_1.1.3              
[45] evaluate_1.0.3          knitr_1.50              rlang_1.1.6             Rcpp_1.0.14            
[49] glue_1.8.0              rstudioapi_0.17.1       vroom_1.6.5             jsonlite_2.0.0         
[53] R6_2.6.1
DESeq2 • 181 views
ADD COMMENT
0
Entering edit mode
Kevin Blighe ★ 4.0k
@kevin
Last seen 8 hours ago
The Cave, 181 Longwood Avenue, Boston, …

The use of DESeq2 for count data from artificial sequences is acceptable, as the program models counts via a negative binomial distribution and has been applied to non-RNA-seq data in published studies. The selection of a local regression fit for dispersion estimates occurs when the parametric model does not adequately fit the data, which is common for datasets that deviate from typical gene expression patterns.

In standard RNA-seq data derived from genes, dispersion typically decreases with increasing mean counts because the variance follows a structure where it equals the mean plus a term proportional to the mean squared. This leads to higher relative dispersion at low counts due to Poisson noise dominance. For your artificial sequences, the lack of decreasing dispersion indicates that the overdispersion does not follow this mean-dependent pattern, possibly because the sequences have different statistical properties or technical variability. This behavior is expected for non-gene count data if the underlying data generation process differs from biological gene expression.

This dispersion pattern does not invalidate the DESeq2 results, as the local fit adapts to the observed trend. You should interpret the results as usual: log2 fold changes represent differences in abundance between conditions, and adjusted p-values indicate statistical significance after multiple testing correction. To verify the fit, examine diagnostic plots such as the mean-variance relationship using:

plotDispEsts(dds)

If concerns remain, consider alternative models like edgeR or limma-voom, but DESeq2 remains robust here.

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 509 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6