I have found some cases where the fold change values don't make sense to me, and I am wondering if I hit some edge case bug. Here is one example, notice that row 3 has more MUT1 counts than rows 1 and 2, but a lower fold change.
WT_1 MUT_1 log2FoldChange 1 0 3 2.348419e+00 2 0 3 2.348419e+00 3 0 16 2.225513e+00 4 2 3 1.369097e-07
I have only tested this with two sample data (no replicates), and only observed this result for genes that have zero counts in one sample. Below is code which tests a few different data values and reports if the log2FC is monotonic with the MUT1 counts for genes that have WT_1 == 0. Thanks for any help!
Josh
library(DESeq2) library(magrittr) library(dplyr) count.df = data.frame(WT_1=c(0,0,0,2), MUT_1=c(3,3,16,3)) rownames(count.df) = paste0("gene_", seq(nrow(count.df))) metadata.df = data.frame(row.names=c("WT_1","MUT_1"),collection=c("wt","mut")) dds = DESeqDataSetFromMatrix(countData = count.df, colData = metadata.df, design = ~ collection) dds$collection <- relevel(dds$collection, ref="wt") dds <- DESeq(dds, minReplicatesForReplace=Inf, fitType="mean") res <- results(dds, contrast=c("collection","mut","wt"), independentFiltering=FALSE, cooksCutoff=FALSE) dat <- data.frame(counts(dds), log2FoldChange=res$log2FoldChange, dispOutlier = mcols(dds)$dispOutlier) dat <- dat[dat$WT == 0,] dat <- dat[order(dat$MUT),] plot(dat$log2FoldChange)
> sessionInfo() R version 3.2.1 (2015-06-18) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.9.5 (Mavericks) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] dplyr_0.4.3 magrittr_1.5 DESeq2_1.10.1 RcppArmadillo_0.6.400.2.2 [5] Rcpp_0.12.3 SummarizedExperiment_1.0.2 Biobase_2.30.0 GenomicRanges_1.22.3 [9] GenomeInfoDb_1.6.2 IRanges_2.4.6 S4Vectors_0.8.7 BiocGenerics_0.16.1 loaded via a namespace (and not attached): [1] RColorBrewer_1.1-2 futile.logger_1.4.1 plyr_1.8.3 XVector_0.10.0 futile.options_1.0.0 tools_3.2.1 [7] zlibbioc_1.16.0 rpart_4.1-9 packrat_0.4.4 RSQLite_1.0.0 annotate_1.48.0 gtable_0.1.2 [13] lattice_0.20-31 DBI_0.3.1 gridExtra_2.0.0 genefilter_1.52.0 cluster_2.0.1 locfit_1.5-9.1 [19] grid_3.2.1 nnet_7.3-9 R6_2.1.1 AnnotationDbi_1.32.3 XML_3.98-1.3 survival_2.38-1 [25] BiocParallel_1.4.3 foreign_0.8-63 latticeExtra_0.6-26 Formula_1.2-1 geneplotter_1.48.0 ggplot2_2.0.0 [31] lambda.r_1.1.7 Hmisc_3.17-1 scales_0.3.0 splines_3.2.1 assertthat_0.1 xtable_1.8-0 [37] colorspace_1.2-6 acepack_1.3-3.3 lazyeval_0.1.10 munsell_0.4.2
EDIT: Updated to include dispOutlier in the output.
OK, updated example code to be less verbose, and hopefully more clear
So this plot shows non-monotonicity?
Oh, I see, there are only 4 rows here. Can you replicate this effect with a DESeq() run on a full-sized dataset (hundreds - thousands of genes)?
One thing that's happening is that, the estimation of the spread of dispersion estimates around the prior center is suffering from a paucity of data (4 points). So the 0 vs 16 row is being labelled a "dispersion outlier"
But in general, the estimation of priors (both dispersion and fold change priors) absolutely relies on feeding in a decent number of rows.
OK, that sheds some light on it, I first wondered if the rows with seemingly incorrect FC values were having problems because they were outliers. But then noticed that small changes in input resulted in more sensible FC values, which smelled like an edge case bug. For example
count.df = data.frame(WT_1=c(0,0,0,4), MUT_1=c(3,3,16,3))
results inand
count.df = data.frame(WT_1=c(0,0,0,1), MUT_1=c(3,3,16,3))
results inBoth meet the relative order I expect for FC values, but the first is called 0 vs 16 an outlier, and the second isn't.
I do have a full-sized dataset where I originally spotted this problem. I checked and confirmed that I see the same effect there: rows with non-monotonic FC values are tagged as
dispOutlier
, but some rows that aredispOutlier
have monotonic FC values. Is that to be expected?The full data is not RNA-seq, so I may be abusing DESeq2 with it. All the counts are fairly low, and there are a lot of zeros. I'll check outliers on the full dataset. I am happy to share the full data if that would be helpful.
If you keep seeing this effect in the full dataset with no replicates, you can avoid dispersion outlier flagging with this call:
Oops sorry, this example is monotonically decreasing, which also seems to be a problem. I would expect that if gene1$WT == gene2$WT and gene1$MUT < gene2$MUT, then FC(gene1) < FC(gene2). But here FC(gene1) > FC(gene2).
I do have less minimal datasets that DO show non-monotonicity, but I have not yet wrapped my head around the pattern, so have not figured out a minimal non-monotonic dataset yet.
Perhaps the title needs some clarification too.