P-values change 100-fold with only a single count changed
1
0
Entering edit mode
Andrey • 0
@824e735b
Last seen 2.2 years ago
Russia

Hi Michael and everyone,

We did a simple tutorial for students and detected a strange behavior. Changing just a single count in a table of 1000 genes changes adjusted P-values from ~0.98 to more expected in range of 10^-6 ~ 1:

Line in counts_bad.tsv: ENSDARG00000089346 0 0 2 0

Top 5 adjusted P-values: 0.9804526 0.9804526 0.9804526 0.9804526 0.9804526

Same line in counts_ok.tsv ENSDARG00000089346 0 0 1 0

Top 5 adjusted P-values: 1.128419e-06 1.128419e-06 2.550727e-06 3.131545e-06 1.545015e-05

Code and session info are provided below, input files can be found here https://drive.google.com/file/d/1qCYzSECMmWkyJx-EvkQzOmS406DPkILe/view?usp=sharing https://drive.google.com/file/d/1yGEZmZ9UH9djPBPXw9YGsno-rxY7Fh8z/view?usp=sharing

Is it normal behaviour or am I doing something completely wrong? Thanks a lot for your time!

Best

Andrey


#url <- "https://bioconductor.org/packages/release/bioc/src/contrib/DESeq2_1.34.0.tar.gz"
#install.packages(url, repos=NULL, type="source")
library(DESeq2)
packageVersion("DESeq2")

# Set up the conditions based on the experimental setup.
cond_1 = rep("cond1", 2)
cond_2 = rep("cond2", 2)
# Read the data from the standard input.
countData = read.table("counts_bad.tsv", header=TRUE, sep="\t", row.names=1 )

# Build the dataframe from the conditions
samples = names(countData)
condition = factor(c(cond_1, cond_2))
colData = data.frame(samples=samples, condition=condition)

# Create DESEq2 dataset.
dds = DESeqDataSetFromMatrix(countData=countData, colData=colData, design = ~condition)
#Set the reference to be compared
dds$condition = relevel(dds$condition,"cond1")
# Run deseq
dds = DESeq(dds, fitType = 'mean')
# Format the results.
res = results(dds)
row.names(res)[1:5]
res$padj[1:5]
# Sort the results data frame by the padj and foldChange columns.
sorted = res[with(res, order(padj, -log2FoldChange)), ]
row.names(sorted)[1:5]
sorted$padj[1:5]
sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=ru_RU.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=ru_RU.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=ru_RU.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=ru_RU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DESeq2_1.34.0               SummarizedExperiment_1.16.1 DelayedArray_0.12.3         BiocParallel_1.20.1        
 [5] matrixStats_0.61.0          Biobase_2.46.0              GenomicRanges_1.38.0        GenomeInfoDb_1.22.1        
 [9] IRanges_2.20.2              S4Vectors_0.24.4            BiocGenerics_0.32.0        

loaded via a namespace (and not attached):
 [1] genefilter_1.68.0      locfit_1.5-9.4         splines_3.6.3          lattice_0.20-40        colorspace_2.0-2      
 [6] vctrs_0.3.8            utf8_1.2.2             blob_1.2.2             survival_3.1-8         XML_3.99-0.3          
[11] rlang_0.4.12           pillar_1.6.4           glue_1.5.1             DBI_1.1.1              bit64_4.0.5           
[16] RColorBrewer_1.1-2     GenomeInfoDbData_1.2.2 lifecycle_1.0.1        zlibbioc_1.32.0        munsell_0.5.0         
[21] gtable_0.3.0           memoise_2.0.1          geneplotter_1.64.0     fastmap_1.1.0          AnnotationDbi_1.48.0  
[26] fansi_0.5.0            Rcpp_1.0.7             xtable_1.8-4           scales_1.1.1           BiocManager_1.30.16   
[31] cachem_1.0.6           annotate_1.64.0        XVector_0.26.0         bit_4.0.4              ggplot2_3.3.5         
[36] grid_3.6.3             tools_3.6.3            bitops_1.0-7           magrittr_2.0.1         RCurl_1.98-1.5        
[41] RSQLite_2.2.9          tibble_3.1.6           pkgconfig_2.0.3        crayon_1.4.2           ellipsis_0.3.2        
[46] Matrix_1.2-18          R6_2.5.1               compiler_3.6.3
DESeq2 • 769 views
ADD COMMENT
3
Entering edit mode
@mikelove
Last seen 8 hours ago
United States

I think this dataset is too small (for its sparsity) to run DESeq2 on.

!> table(rowSums(countData > 0))                                                                                              

   0   1   2   3   4                                                                                                          
 471  36 140  35 327

You can see that it's not able to estimate dispersion values reliably if you check plotDispEsts for either dataset. I see only ~140 features with a count of 10 or more for more than 2 samples. The majority of the features then have 2 or more zeros and are less useful for estimating size factors and dispersion which the p-values depend on. While you can run DESeq2 on datasets with low hundreds of features, it won't be sensible if the dispersion estimation only has ~1 or 2 samples out of 4 with a non-zero count to work with.

ADD COMMENT
0
Entering edit mode

Dear Michael,

Thanks a lot for looking at this data! Totally makes sense.

Yes, this is a small dataset we use to teach students, hence the small numbers arise. We'll probably make it a bit bigger.

Best Andrey

ADD REPLY

Login before adding your answer.

Traffic: 489 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6