Question

DESeq2: Shrinkage is inverting fold-changes

0

Entering edit mode

Gavin Kelly ▴ 690

@gavin-kelly-6944

Last seen 5.0 years ago

United Kingdom / London / Francis Crick…

The following code produces a case where the MAP estimate of the log fold-change is opposite in sign to the MLE estimate. I know the fold-change is small in absolute terms, but it's still a little bit tricky to explain the results back to a scientist. The dispersion, size factors, and more importantly betaPriorVars, were all estimated on a full dataset (which still exhibits the phenomenon), but here I'm just reproducing a few rows.

set.seed(1)
library(DESeq2)

myCounts <- t(matrix(c(
280, 183, 232, 205, 245, 149, 184, 186, 164, 187, 289, 190, 218, 207, 264, 190, 133, 179,
692, 660, 684, 608, 712, 681, 576, 533, 390, 511, 815, 647, 680, 629, 756, 553, 487, 408,
117, 59, 121, 108, 83, 85, 65, 51, 68, 86, 114, 62, 77, 112, 122, 77, 64, 53,
42, 32, 33, 40, 39, 40, 27, 30, 22, 44, 36, 27, 53, 22, 60, 39, 29, 32,
14,  3,  8, 26, 31,  4, 39,  2,  7, 23,  5, 26,  3, 14, 24, 14,  8, 26)
,18,6))

myTreatment <- as.factor(c(6, 2, 5, 6, 4, 3, 1, 2, 5, 4, 1, 2, 3, 4, 5, 6, 1, 3))
myDDS <- DESeqDataSetFromMatrix(myCounts, data.frame(treat=myTreatment), design=~treat)
myDDS <- DESeq(myDDS)
mySF <- c(1.1380559, 1.0914526, 1.1332803, 0.9004886, 1.1694250, 0.9821480, 0.9735406, 0.9404061, 0.7625018, 0.8666049, 1.4300050, 1.0023437, 1.1424392, 0.9873334, 1.2176223, 0.9470914, 0.8080119, 0.840594)

sizeFactors(myDDS) <- mySF

dispersions(myDDS) <- c(0.010264991, 0.009097275, 0.024326553, 0.049801039, 0.014744340, 0.597720247)
myDDS <- nbinomWaldTest(myDDS, betaPriorVar=c(1e6, rep(6.05836e-02, 6)))
res <- results(myDDS, c("treat","3","2"), addMLE=TRUE)
res[sign(res$log2FoldChange) != sign(res$lfcMLE),]

   baseMean log2FoldChange      lfcMLE  ...   
1  198.6187   0.0001523389 -1.5469e-05

Perhaps unrelated, and less relevant here, but I'm wondering if there are any discussions of careful, considered use of DESeq2 for real-life data. I'm aware of the 'RNA-seq workflow at the gene level' document, but my investigation into the sign-change quirk (I got as far as the the C-code for fitBeta before realising I was a bit out of my depth) has increased my admiration for the package, but also flagged that I'm unaware of the full consequences of using the shrinkage (the way it averages variances across contrasts means that I may want to slice the data before DESeq'ing it if I have potential contrasts with drastically different numbers of differential-genes; when my MLE fold-changes look very tight but with longer-than-normal tails, should I really be using shrinkage; how far does shrinkage make multiple-testing unnecessary; ...)

I think I should write up my (probably misguided) thoughts on this, but am just wondering if anyone else has captured their experience of getting the best out of this wonderful package.

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
#It happens in our Linux install as well, but we're not quite up-to-date, so this is run on a system I have control over.
locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] DESeq2_1.12.3              SummarizedExperiment_1.2.3
[3] Biobase_2.32.0             GenomicRanges_1.24.2      
[5] GenomeInfoDb_1.8.3         IRanges_2.6.1             
[7] S4Vectors_0.10.2           BiocGenerics_0.18.0

deseq2 differential gene expression • 2.3k views

ADD COMMENT • link updated 8.7 years ago by Michael Love 43k • written 8.7 years ago by Gavin Kelly ▴ 690

score 3 · Accepted Answer · 2016-07-25

All regularization methods (including LASSO, ridge regression, elastic net, etc.) can change the sign of coefficients in designs more complex than the simple two group comparisons, especially for those coefficients whose likelihood is very flat: this means that the data gives almost no statistical information to inform the size or direction of the fold change.

"(the way it averages variances across contrasts means that I may want to slice the data before DESeq'ing it if I have potential contrasts with drastically different numbers of differential-genes; when my MLE fold-changes look very tight but with longer-than-normal tails, should I really be using shrinkage; how far does shrinkage make multiple-testing unnecessary; ...)"

This is discussed in one of the FAQ in the back of the DESeq2 vignette: that, if certain groups have high within-group variance, subsetting to two group comparisons can increase power for the comparisons with lower within-group variance. Instead of "averaging variance across contrasts", I'd say that DESeq2 estimates a single dispersion value per gene given the experimental design, which is the same approach as other empirical Bayes methods on Bioconductor.

The shrinkage produces a useful effect for most RNA-seq experiments, helping with ranking and visualization. But it is certainly not a necessary component of the model, and it's easy enough to turn it off with DESeq(..., betaPrior=FALSE). Here you have 6 groups and so lots of possible pairwise comparisons. If you just want to report the simple log2 fold changes equal to (mean of normalized counts group B) / (mean of normalized counts group A), just turn off the prior on beta (LFCs).