Hi Michael and everyone,
We did a simple tutorial for students and detected a strange behavior. Changing just a single count in a table of 1000 genes changes adjusted P-values from ~0.98 to more expected in range of 10^-6 ~ 1:
Line in counts_bad.tsv: ENSDARG00000089346 0 0 2 0
Top 5 adjusted P-values: 0.9804526 0.9804526 0.9804526 0.9804526 0.9804526
Same line in counts_ok.tsv ENSDARG00000089346 0 0 1 0
Top 5 adjusted P-values: 1.128419e-06 1.128419e-06 2.550727e-06 3.131545e-06 1.545015e-05
Code and session info are provided below, input files can be found here https://drive.google.com/file/d/1qCYzSECMmWkyJx-EvkQzOmS406DPkILe/view?usp=sharing https://drive.google.com/file/d/1yGEZmZ9UH9djPBPXw9YGsno-rxY7Fh8z/view?usp=sharing
Is it normal behaviour or am I doing something completely wrong? Thanks a lot for your time!
Best
Andrey
#url <- "https://bioconductor.org/packages/release/bioc/src/contrib/DESeq2_1.34.0.tar.gz"
#install.packages(url, repos=NULL, type="source")
library(DESeq2)
packageVersion("DESeq2")
# Set up the conditions based on the experimental setup.
cond_1 = rep("cond1", 2)
cond_2 = rep("cond2", 2)
# Read the data from the standard input.
countData = read.table("counts_bad.tsv", header=TRUE, sep="\t", row.names=1 )
# Build the dataframe from the conditions
samples = names(countData)
condition = factor(c(cond_1, cond_2))
colData = data.frame(samples=samples, condition=condition)
# Create DESEq2 dataset.
dds = DESeqDataSetFromMatrix(countData=countData, colData=colData, design = ~condition)
#Set the reference to be compared
dds$condition = relevel(dds$condition,"cond1")
# Run deseq
dds = DESeq(dds, fitType = 'mean')
# Format the results.
res = results(dds)
row.names(res)[1:5]
res$padj[1:5]
# Sort the results data frame by the padj and foldChange columns.
sorted = res[with(res, order(padj, -log2FoldChange)), ]
row.names(sorted)[1:5]
sorted$padj[1:5]
sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=ru_RU.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=ru_RU.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=ru_RU.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=ru_RU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] DESeq2_1.34.0 SummarizedExperiment_1.16.1 DelayedArray_0.12.3 BiocParallel_1.20.1
[5] matrixStats_0.61.0 Biobase_2.46.0 GenomicRanges_1.38.0 GenomeInfoDb_1.22.1
[9] IRanges_2.20.2 S4Vectors_0.24.4 BiocGenerics_0.32.0
loaded via a namespace (and not attached):
[1] genefilter_1.68.0 locfit_1.5-9.4 splines_3.6.3 lattice_0.20-40 colorspace_2.0-2
[6] vctrs_0.3.8 utf8_1.2.2 blob_1.2.2 survival_3.1-8 XML_3.99-0.3
[11] rlang_0.4.12 pillar_1.6.4 glue_1.5.1 DBI_1.1.1 bit64_4.0.5
[16] RColorBrewer_1.1-2 GenomeInfoDbData_1.2.2 lifecycle_1.0.1 zlibbioc_1.32.0 munsell_0.5.0
[21] gtable_0.3.0 memoise_2.0.1 geneplotter_1.64.0 fastmap_1.1.0 AnnotationDbi_1.48.0
[26] fansi_0.5.0 Rcpp_1.0.7 xtable_1.8-4 scales_1.1.1 BiocManager_1.30.16
[31] cachem_1.0.6 annotate_1.64.0 XVector_0.26.0 bit_4.0.4 ggplot2_3.3.5
[36] grid_3.6.3 tools_3.6.3 bitops_1.0-7 magrittr_2.0.1 RCurl_1.98-1.5
[41] RSQLite_2.2.9 tibble_3.1.6 pkgconfig_2.0.3 crayon_1.4.2 ellipsis_0.3.2
[46] Matrix_1.2-18 R6_2.5.1 compiler_3.6.3
Dear Michael,
Thanks a lot for looking at this data! Totally makes sense.
Yes, this is a small dataset we use to teach students, hence the small numbers arise. We'll probably make it a bit bigger.
Best Andrey