Question

DESeq2 on non-RNASeq counts data

0

Entering edit mode

vs • 0

@f039be33

Last seen 5 months ago

United States

Dispersion plot for counts data

Hi,

I'm using DESeq2 to analyze counts data that doesn't come from standard RNA-seq. In this case, the data consist of artificial sequences expressed in samples, rather than genes.

DESeq2 automatically chose a local regression fit for the dispersion estimates instead of a parametric fit, which makes sense. The dispersion plot looks fine overall, but I noticed that dispersion does not decrease with increasing counts, unlike what is typically seen in gene-level RNA-seq data.

Is this behavior expected for non-gene count data, and does it affect how I should interpret the DESeq2 results?

Thanks!

sessionInfo( )

R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DESeq2_1.48.0               SummarizedExperiment_1.38.1 Biobase_2.68.0             
 [4] MatrixGenerics_1.20.0       matrixStats_1.5.0           GenomicRanges_1.60.0       
 [7] GenomeInfoDb_1.44.0         IRanges_2.42.0              S4Vectors_0.46.0           
[10] BiocGenerics_0.54.0         generics_0.1.3              lubridate_1.9.4            
[13] forcats_1.0.0               stringr_1.5.1               dplyr_1.1.4                
[16] purrr_1.0.4                 readr_2.1.5                 tidyr_1.3.1                
[19] tibble_3.2.1                ggplot2_3.5.2               tidyverse_2.0.0            

loaded via a namespace (and not attached):
 [1] gtable_0.3.6            xfun_0.52               lattice_0.22-7          tzdb_0.5.0             
 [5] vctrs_0.6.5             tools_4.5.1             parallel_4.5.1          pkgconfig_2.0.3        
 [9] Matrix_1.7-3            RColorBrewer_1.1-3      lifecycle_1.0.4         GenomeInfoDbData_1.2.14
[13] compiler_4.5.1          farver_2.1.2            codetools_0.2-20        htmltools_0.5.8.1      
[17] yaml_2.3.10             pillar_1.10.2           crayon_1.5.3            BiocParallel_1.42.0    
[21] DelayedArray_0.34.1     abind_1.4-8             tidyselect_1.2.1        locfit_1.5-9.12        
[25] digest_0.6.37           stringi_1.8.7           labeling_0.4.3          cowplot_1.1.3          
[29] fastmap_1.2.0           grid_4.5.1              cli_3.6.5               SparseArray_1.8.0      
[33] magrittr_2.0.3          S4Arrays_1.8.0          withr_3.0.2             scales_1.4.0           
[37] UCSC.utils_1.4.0        bit64_4.6.0-1           timechange_0.3.0        rmarkdown_2.29         
[41] XVector_0.48.0          httr_1.4.7              bit_4.6.0               hms_1.1.3              
[45] evaluate_1.0.3          knitr_1.50              rlang_1.1.6             Rcpp_1.0.14            
[49] glue_1.8.0              rstudioapi_0.17.1       vroom_1.6.5             jsonlite_2.0.0         
[53] R6_2.6.1

DESeq2 • 379 views

ADD COMMENT • link written 5 months ago by vs • 0

score 0 · Answer 1 · 2025-11-20

The use of DESeq2 for count data from artificial sequences is acceptable, as the program models counts via a negative binomial distribution and has been applied to non-RNA-seq data in published studies. The selection of a local regression fit for dispersion estimates occurs when the parametric model does not adequately fit the data, which is common for datasets that deviate from typical gene expression patterns.

In standard RNA-seq data derived from genes, dispersion typically decreases with increasing mean counts because the variance follows a structure where it equals the mean plus a term proportional to the mean squared. This leads to higher relative dispersion at low counts due to Poisson noise dominance. For your artificial sequences, the lack of decreasing dispersion indicates that the overdispersion does not follow this mean-dependent pattern, possibly because the sequences have different statistical properties or technical variability. This behavior is expected for non-gene count data if the underlying data generation process differs from biological gene expression.

This dispersion pattern does not invalidate the DESeq2 results, as the local fit adapts to the observed trend. You should interpret the results as usual: log2 fold changes represent differences in abundance between conditions, and adjusted p-values indicate statistical significance after multiple testing correction. To verify the fit, examine diagnostic plots such as the mean-variance relationship using:

plotDispEsts(dds)

If concerns remain, consider alternative models like edgeR or limma-voom, but DESeq2 remains robust here.

Kevin