P-values change 100-fold with only a single count changed
Entering edit mode
Andrey • 0
Last seen 4 months ago

Hi Michael and everyone,

We did a simple tutorial for students and detected a strange behavior. Changing just a single count in a table of 1000 genes changes adjusted P-values from ~0.98 to more expected in range of 10^-6 ~ 1:

Line in counts_bad.tsv: ENSDARG00000089346 0 0 2 0

Top 5 adjusted P-values: 0.9804526 0.9804526 0.9804526 0.9804526 0.9804526

Same line in counts_ok.tsv ENSDARG00000089346 0 0 1 0

Top 5 adjusted P-values: 1.128419e-06 1.128419e-06 2.550727e-06 3.131545e-06 1.545015e-05

Code and session info are provided below, input files can be found here https://drive.google.com/file/d/1qCYzSECMmWkyJx-EvkQzOmS406DPkILe/view?usp=sharing https://drive.google.com/file/d/1yGEZmZ9UH9djPBPXw9YGsno-rxY7Fh8z/view?usp=sharing

Is it normal behaviour or am I doing something completely wrong? Thanks a lot for your time!



#url <- "https://bioconductor.org/packages/release/bioc/src/contrib/DESeq2_1.34.0.tar.gz"
#install.packages(url, repos=NULL, type="source")

# Set up the conditions based on the experimental setup.
cond_1 = rep("cond1", 2)
cond_2 = rep("cond2", 2)
# Read the data from the standard input.
countData = read.table("counts_bad.tsv", header=TRUE, sep="\t", row.names=1 )

# Build the dataframe from the conditions
samples = names(countData)
condition = factor(c(cond_1, cond_2))
colData = data.frame(samples=samples, condition=condition)

# Create DESEq2 dataset.
dds = DESeqDataSetFromMatrix(countData=countData, colData=colData, design = ~condition)
#Set the reference to be compared
dds$condition = relevel(dds$condition,"cond1")
# Run deseq
dds = DESeq(dds, fitType = 'mean')
# Format the results.
res = results(dds)
# Sort the results data frame by the padj and foldChange columns.
sorted = res[with(res, order(padj, -log2FoldChange)), ]
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=ru_RU.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=ru_RU.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=ru_RU.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DESeq2_1.34.0               SummarizedExperiment_1.16.1 DelayedArray_0.12.3         BiocParallel_1.20.1        
 [5] matrixStats_0.61.0          Biobase_2.46.0              GenomicRanges_1.38.0        GenomeInfoDb_1.22.1        
 [9] IRanges_2.20.2              S4Vectors_0.24.4            BiocGenerics_0.32.0        

loaded via a namespace (and not attached):
 [1] genefilter_1.68.0      locfit_1.5-9.4         splines_3.6.3          lattice_0.20-40        colorspace_2.0-2      
 [6] vctrs_0.3.8            utf8_1.2.2             blob_1.2.2             survival_3.1-8         XML_3.99-0.3          
[11] rlang_0.4.12           pillar_1.6.4           glue_1.5.1             DBI_1.1.1              bit64_4.0.5           
[16] RColorBrewer_1.1-2     GenomeInfoDbData_1.2.2 lifecycle_1.0.1        zlibbioc_1.32.0        munsell_0.5.0         
[21] gtable_0.3.0           memoise_2.0.1          geneplotter_1.64.0     fastmap_1.1.0          AnnotationDbi_1.48.0  
[26] fansi_0.5.0            Rcpp_1.0.7             xtable_1.8-4           scales_1.1.1           BiocManager_1.30.16   
[31] cachem_1.0.6           annotate_1.64.0        XVector_0.26.0         bit_4.0.4              ggplot2_3.3.5         
[36] grid_3.6.3             tools_3.6.3            bitops_1.0-7           magrittr_2.0.1         RCurl_1.98-1.5        
[41] RSQLite_2.2.9          tibble_3.1.6           pkgconfig_2.0.3        crayon_1.4.2           ellipsis_0.3.2        
[46] Matrix_1.2-18          R6_2.5.1               compiler_3.6.3
DESeq2 • 309 views
Entering edit mode
Last seen 1 day ago
United States

I think this dataset is too small (for its sparsity) to run DESeq2 on.

!> table(rowSums(countData > 0))                                                                                              

   0   1   2   3   4                                                                                                          
 471  36 140  35 327

You can see that it's not able to estimate dispersion values reliably if you check plotDispEsts for either dataset. I see only ~140 features with a count of 10 or more for more than 2 samples. The majority of the features then have 2 or more zeros and are less useful for estimating size factors and dispersion which the p-values depend on. While you can run DESeq2 on datasets with low hundreds of features, it won't be sensible if the dispersion estimation only has ~1 or 2 samples out of 4 with a non-zero count to work with.

Entering edit mode

Dear Michael,

Thanks a lot for looking at this data! Totally makes sense.

Yes, this is a small dataset we use to teach students, hence the small numbers arise. We'll probably make it a bit bigger.

Best Andrey


Login before adding your answer.

Traffic: 269 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6