Question

Fold Change Not Monotonic With Counts

0

Entering edit mode

jgranek • 0

@jgranek-9614

Last seen 6.9 years ago

I have found some cases where the fold change values don't make sense to me, and I am wondering if I hit some edge case bug. Here is one example, notice that row 3 has more MUT1 counts than rows 1 and 2, but a lower fold change.

   WT_1 MUT_1 log2FoldChange
1     0     3   2.348419e+00
2     0     3   2.348419e+00
3     0    16   2.225513e+00
4     2     3   1.369097e-07

I have only tested this with two sample data (no replicates), and only observed this result for genes that have zero counts in one sample. Below is code which tests a few different data values and reports if the log2FC is monotonic with the MUT1 counts for genes that have WT_1 == 0. Thanks for any help!

Josh

library(DESeq2)
library(magrittr)
library(dplyr)

count.df = data.frame(WT_1=c(0,0,0,2), MUT_1=c(3,3,16,3))
rownames(count.df) = paste0("gene_", seq(nrow(count.df)))
metadata.df = data.frame(row.names=c("WT_1","MUT_1"),collection=c("wt","mut"))
 
dds = DESeqDataSetFromMatrix(countData = count.df,
                             colData = metadata.df,
                             design = ~ collection)

dds$collection <- relevel(dds$collection, ref="wt")
dds <- DESeq(dds, minReplicatesForReplace=Inf, fitType="mean")
res <- results(dds, contrast=c("collection","mut","wt"),
                 independentFiltering=FALSE,
                 cooksCutoff=FALSE)
 
dat <- data.frame(counts(dds), log2FoldChange=res$log2FoldChange,
                  dispOutlier = mcols(dds)$dispOutlier)
dat <- dat[dat$WT == 0,]
dat <- dat[order(dat$MUT),]
plot(dat$log2FoldChange)

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] dplyr_0.4.3                magrittr_1.5               DESeq2_1.10.1              RcppArmadillo_0.6.400.2.2
 [5] Rcpp_0.12.3                SummarizedExperiment_1.0.2 Biobase_2.30.0             GenomicRanges_1.22.3      
 [9] GenomeInfoDb_1.6.2         IRanges_2.4.6              S4Vectors_0.8.7            BiocGenerics_0.16.1       

loaded via a namespace (and not attached):
 [1] RColorBrewer_1.1-2   futile.logger_1.4.1  plyr_1.8.3           XVector_0.10.0       futile.options_1.0.0 tools_3.2.1         
 [7] zlibbioc_1.16.0      rpart_4.1-9          packrat_0.4.4        RSQLite_1.0.0        annotate_1.48.0      gtable_0.1.2        
[13] lattice_0.20-31      DBI_0.3.1            gridExtra_2.0.0      genefilter_1.52.0    cluster_2.0.1        locfit_1.5-9.1      
[19] grid_3.2.1           nnet_7.3-9           R6_2.1.1             AnnotationDbi_1.32.3 XML_3.98-1.3         survival_2.38-1     
[25] BiocParallel_1.4.3   foreign_0.8-63       latticeExtra_0.6-26  Formula_1.2-1        geneplotter_1.48.0   ggplot2_2.0.0       
[31] lambda.r_1.1.7       Hmisc_3.17-1         scales_0.3.0         splines_3.2.1        assertthat_0.1       xtable_1.8-0        
[37] colorspace_1.2-6     acepack_1.3-3.3      lazyeval_0.1.10      munsell_0.4.2

EDIT: Updated to include dispOutlier in the output.

DESeq2 • 1.3k views

ADD COMMENT • link 8.9 years ago jgranek • 0

0

Entering edit mode

OK, updated example code to be less verbose, and hopefully more clear

ADD REPLY • link 8.9 years ago jgranek • 0

0

Entering edit mode

So this plot shows non-monotonicity?

ADD REPLY • link 8.9 years ago Michael Love 43k

0

Entering edit mode

Oh, I see, there are only 4 rows here. Can you replicate this effect with a DESeq() run on a full-sized dataset (hundreds - thousands of genes)?

One thing that's happening is that, the estimation of the spread of dispersion estimates around the prior center is suffering from a paucity of data (4 points). So the 0 vs 16 row is being labelled a "dispersion outlier"

> mcols(dds)$dispOutlier
[1] FALSE FALSE  TRUE FALSE

But in general, the estimation of priors (both dispersion and fold change priors) absolutely relies on feeding in a decent number of rows.

ADD REPLY • link 8.9 years ago Michael Love 43k

0

Entering edit mode

OK, that sheds some light on it, I first wondered if the rows with seemingly incorrect FC values were having problems because they were outliers. But then noticed that small changes in input resulted in more sensible FC values, which smelled like an edge case bug. For example count.df = data.frame(WT_1=c(0,0,0,4), MUT_1=c(3,3,16,3)) results in

       WT_1 MUT_1 log2FoldChange dispOutlier
gene_1    0     3       2.771620       FALSE
gene_2    0     3       2.771620       FALSE
gene_3    0    16       2.852268        TRUE

and count.df = data.frame(WT_1=c(0,0,0,1), MUT_1=c(3,3,16,3)) results in

       WT_1 MUT_1 log2FoldChange dispOutlier
gene_1    0     3       1.495030       FALSE
gene_2    0     3       1.495030       FALSE
gene_3    0    16       1.958546       FALSE

Both meet the relative order I expect for FC values, but the first is called 0 vs 16 an outlier, and the second isn't.

I do have a full-sized dataset where I originally spotted this problem. I checked and confirmed that I see the same effect there: rows with non-monotonic FC values are tagged as dispOutlier, but some rows that are dispOutlier have monotonic FC values. Is that to be expected?

The full data is not RNA-seq, so I may be abusing DESeq2 with it. All the counts are fairly low, and there are a lot of zeros. I'll check outliers on the full dataset. I am happy to share the full data if that would be helpful.

ADD REPLY • link 8.9 years ago jgranek • 0

0

Entering edit mode

If you keep seeing this effect in the full dataset with no replicates, you can avoid dispersion outlier flagging with this call:

dds <- estimateSizeFactors(dds)
dds <- estimateDispersionsGeneEst(dds)
dds <- estimateDispersionsFit(dds)
dds <- estimateDispersionsMAP(dds, outlierSD = Inf)
dds <- nbinomWaldTest(dds)

ADD REPLY • link 8.9 years ago Michael Love 43k

0

Entering edit mode

Oops sorry, this example is monotonically decreasing, which also seems to be a problem. I would expect that if gene1$WT == gene2$WT and gene1$MUT < gene2$MUT, then FC(gene1) < FC(gene2). But here FC(gene1) > FC(gene2).

I do have less minimal datasets that DO show non-monotonicity, but I have not yet wrapped my head around the pattern, so have not figured out a minimal non-monotonic dataset yet.

Perhaps the title needs some clarification too.

ADD REPLY • link 8.9 years ago jgranek • 0

score 0 · Answer 1 · 2016-01-28

hi,

I tried to replicate this with some simulated data, but I get monotonic fold changes.

Could you simplify the code a bit to see systematically what's going on? So just running DESeq() and then results() using the arguments you have above, then:

dat <- data.frame(counts(dds), log2FoldChange=res$log2FoldChange)
dat <- dat[dat$WT == 0,]
dat <- dat[order(dat$MUT),]
plot(dat$log2FoldChange)