The following code produces a case where the MAP estimate of the log fold-change is opposite in sign to the MLE estimate. I know the fold-change is small in absolute terms, but it's still a little bit tricky to explain the results back to a scientist. The dispersion, size factors, and more importantly betaPriorVars, were all estimated on a full dataset (which still exhibits the phenomenon), but here I'm just reproducing a few rows.
set.seed(1) library(DESeq2) myCounts <- t(matrix(c( 280, 183, 232, 205, 245, 149, 184, 186, 164, 187, 289, 190, 218, 207, 264, 190, 133, 179, 692, 660, 684, 608, 712, 681, 576, 533, 390, 511, 815, 647, 680, 629, 756, 553, 487, 408, 117, 59, 121, 108, 83, 85, 65, 51, 68, 86, 114, 62, 77, 112, 122, 77, 64, 53, 42, 32, 33, 40, 39, 40, 27, 30, 22, 44, 36, 27, 53, 22, 60, 39, 29, 32, 14, 3, 8, 26, 31, 4, 39, 2, 7, 23, 5, 26, 3, 14, 24, 14, 8, 26) ,18,6)) myTreatment <- as.factor(c(6, 2, 5, 6, 4, 3, 1, 2, 5, 4, 1, 2, 3, 4, 5, 6, 1, 3)) myDDS <- DESeqDataSetFromMatrix(myCounts, data.frame(treat=myTreatment), design=~treat) myDDS <- DESeq(myDDS) mySF <- c(1.1380559, 1.0914526, 1.1332803, 0.9004886, 1.1694250, 0.9821480, 0.9735406, 0.9404061, 0.7625018, 0.8666049, 1.4300050, 1.0023437, 1.1424392, 0.9873334, 1.2176223, 0.9470914, 0.8080119, 0.840594) sizeFactors(myDDS) <- mySF dispersions(myDDS) <- c(0.010264991, 0.009097275, 0.024326553, 0.049801039, 0.014744340, 0.597720247) myDDS <- nbinomWaldTest(myDDS, betaPriorVar=c(1e6, rep(6.05836e-02, 6))) res <- results(myDDS, c("treat","3","2"), addMLE=TRUE) res[sign(res$log2FoldChange) != sign(res$lfcMLE),] baseMean log2FoldChange lfcMLE ... 1 198.6187 0.0001523389 -1.5469e-05
Perhaps unrelated, and less relevant here, but I'm wondering if there are any discussions of careful, considered use of DESeq2 for real-life data. I'm aware of the 'RNA-seq workflow at the gene level' document, but my investigation into the sign-change quirk (I got as far as the the C-code for fitBeta before realising I was a bit out of my depth) has increased my admiration for the package, but also flagged that I'm unaware of the full consequences of using the shrinkage (the way it averages variances across contrasts means that I may want to slice the data before DESeq'ing it if I have potential contrasts with drastically different numbers of differential-genes; when my MLE fold-changes look very tight but with longer-than-normal tails, should I really be using shrinkage; how far does shrinkage make multiple-testing unnecessary; ...)
I think I should write up my (probably misguided) thoughts on this, but am just wondering if anyone else has captured their experience of getting the best out of this wonderful package.
> sessionInfo() R version 3.3.1 (2016-06-21) Platform: i386-w64-mingw32/i386 (32-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 #It happens in our Linux install as well, but we're not quite up-to-date, so this is run on a system I have control over. locale: [1] LC_COLLATE=English_United Kingdom.1252 [2] LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] DESeq2_1.12.3 SummarizedExperiment_1.2.3 [3] Biobase_2.32.0 GenomicRanges_1.24.2 [5] GenomeInfoDb_1.8.3 IRanges_2.6.1 [7] S4Vectors_0.10.2 BiocGenerics_0.18.0
Thanks for that Michael, just wanted to check it wasn't a numerical instability: my mental model of how a zero-centred prior is combined with likelihood is clearly inadequate. And yes, our 6 groups are possibly unrealistic (it's actually a 2x3 design) but I'm currently trying to advise our bioinformatics group on possibly pipe-lining deseq2 into our default processing, and was hoping that flattening out the structure, so as to enable continued use of the shrinkage, would work out ok. As suspected, I'll go with the MLE approach for such situations.
And thanks for the package, and the great support you provide. Really excellent.