DESeq2: Shrinkage is inverting fold-changes
1
0
Entering edit mode
Gavin Kelly ▴ 680
@gavin-kelly-6944
Last seen 4.0 years ago
United Kingdom / London / Francis Crick…

The following code produces a case where the MAP estimate of the log fold-change is opposite in sign to the MLE estimate.  I know the fold-change is small in absolute terms, but it's still a little bit tricky to explain the results back to a scientist.  The dispersion, size factors, and more importantly betaPriorVars, were all estimated on a full dataset (which still exhibits the phenomenon), but here I'm just reproducing a few rows.

set.seed(1)
library(DESeq2)

myCounts <- t(matrix(c(
280, 183, 232, 205, 245, 149, 184, 186, 164, 187, 289, 190, 218, 207, 264, 190, 133, 179,
692, 660, 684, 608, 712, 681, 576, 533, 390, 511, 815, 647, 680, 629, 756, 553, 487, 408,
117, 59, 121, 108, 83, 85, 65, 51, 68, 86, 114, 62, 77, 112, 122, 77, 64, 53,
42, 32, 33, 40, 39, 40, 27, 30, 22, 44, 36, 27, 53, 22, 60, 39, 29, 32,
14,  3,  8, 26, 31,  4, 39,  2,  7, 23,  5, 26,  3, 14, 24, 14,  8, 26)
,18,6))

myTreatment <- as.factor(c(6, 2, 5, 6, 4, 3, 1, 2, 5, 4, 1, 2, 3, 4, 5, 6, 1, 3))
myDDS <- DESeqDataSetFromMatrix(myCounts, data.frame(treat=myTreatment), design=~treat)
myDDS <- DESeq(myDDS)
mySF <- c(1.1380559, 1.0914526, 1.1332803, 0.9004886, 1.1694250, 0.9821480, 0.9735406, 0.9404061, 0.7625018, 0.8666049, 1.4300050, 1.0023437, 1.1424392, 0.9873334, 1.2176223, 0.9470914, 0.8080119, 0.840594)

sizeFactors(myDDS) <- mySF

dispersions(myDDS) <- c(0.010264991, 0.009097275, 0.024326553, 0.049801039, 0.014744340, 0.597720247)
myDDS <- nbinomWaldTest(myDDS, betaPriorVar=c(1e6, rep(6.05836e-02, 6)))
res <- results(myDDS, c("treat","3","2"), addMLE=TRUE)
res[sign(res$log2FoldChange) != sign(res$lfcMLE),]

   baseMean log2FoldChange      lfcMLE  ...   
1  198.6187   0.0001523389 -1.5469e-05 

Perhaps unrelated, and less relevant here, but I'm wondering if there are any discussions of careful, considered use of DESeq2 for real-life data.  I'm aware of the 'RNA-seq workflow at the gene level' document, but my investigation into the sign-change quirk (I got as far as the the C-code for fitBeta before realising I was a bit out of my depth) has increased my admiration for the package, but also flagged that I'm unaware of the full consequences of using the shrinkage (the way it averages variances across contrasts means that I may want to slice the data before DESeq'ing it if I have potential contrasts with drastically different numbers of differential-genes; when my MLE fold-changes look very tight but with longer-than-normal tails, should I really be using shrinkage; how far does shrinkage make multiple-testing unnecessary;  ...)

I think I should write up my (probably misguided) thoughts on this, but am just wondering if anyone else has captured  their experience of getting the best out of this wonderful package.

 
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
#It happens in our Linux install as well, but we're not quite up-to-date, so this is run on a system I have control over.
locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] DESeq2_1.12.3              SummarizedExperiment_1.2.3
[3] Biobase_2.32.0             GenomicRanges_1.24.2      
[5] GenomeInfoDb_1.8.3         IRanges_2.6.1             
[7] S4Vectors_0.10.2           BiocGenerics_0.18.0       
 
deseq2 differential gene expression • 1.9k views
ADD COMMENT
3
Entering edit mode
@mikelove
Last seen 17 hours ago
United States

All regularization methods (including LASSO, ridge regression, elastic net, etc.) can change the sign of coefficients in designs more complex than the simple two group comparisons, especially for those coefficients whose likelihood is very flat: this means that the data gives almost no statistical information to inform the size or direction of the fold change.

"(the way it averages variances across contrasts means that I may want to slice the data before DESeq'ing it if I have potential contrasts with drastically different numbers of differential-genes; when my MLE fold-changes look very tight but with longer-than-normal tails, should I really be using shrinkage; how far does shrinkage make multiple-testing unnecessary;  ...)"

This is discussed in one of the FAQ in the back of the DESeq2 vignette: that, if certain groups have high within-group variance, subsetting to two group comparisons can increase power for the comparisons with lower within-group variance. Instead of "averaging variance across contrasts", I'd say that DESeq2 estimates a single dispersion value per gene given the experimental design, which is the same approach as other empirical Bayes methods on Bioconductor.

The shrinkage produces a useful effect for most RNA-seq experiments, helping with ranking and visualization. But it is certainly not a necessary component of the model, and it's easy enough to turn it off with DESeq(..., betaPrior=FALSE). Here you have 6 groups and so lots of possible pairwise comparisons. If you just want to report the simple log2 fold changes equal to (mean of normalized counts group B) / (mean of normalized counts group A), just turn off the prior on beta (LFCs).

ADD COMMENT
1
Entering edit mode

Thanks for that Michael, just wanted to check it wasn't a numerical instability: my mental model of how a zero-centred prior is combined with likelihood is clearly inadequate.  And yes, our 6 groups are possibly unrealistic (it's actually a 2x3 design) but I'm currently trying to advise our bioinformatics group on possibly pipe-lining deseq2 into our default processing, and was hoping that flattening out the structure, so as to enable continued use of the shrinkage, would work out ok.  As suspected, I'll go with the MLE approach for such situations.

And thanks for the package, and the great support you provide.  Really excellent.

 

ADD REPLY

Login before adding your answer.

Traffic: 1104 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6