Question: Highly differentially expressed genes not found with DESeq2 and default betaPrior
0
2.7 years ago by
akaever30
akaever30 wrote:

Hi,

the default DESeq2::DESeq(betaPrior=TRUE) parameter seems to prevent finding single highly significant genes (e.g. inserted into a random data set):

library(DESeq2)

ex <- makeExampleDESeqDataSet(n = 60000, m = 10)
design(ex) <- ~ condition

spikeIn <- matrix(as.integer(c(5,0,177,69,119, 57842,27801,105138,97698,126860,
10,11,0,0,2, 806,319,1500,1397,1150,
100,110,90,99,95, 806,319,1500,1397,1150,
0,1,0,1,1, 806,319,1500,1397,1150)),
byrow = TRUE, nrow = 4)
counts(ex)[1:nrow(spikeIn), ] <- spikeIn

ex <- DESeq(ex, parallel = TRUE)
res <- results(ex)
head(res, 4)
log2 fold change (MAP): condition B vs A
Wald test p-value: condition B vs A
DataFrame with 4 rows and 6 columns
baseMean log2FoldChange     lfcSE      stat       pvalue         padj
<numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
gene1 39776.4660       1.283372 0.3460637  3.708485 2.085033e-04 3.347492e-01
gene2   496.9917       1.993635 0.3694553  5.396146 6.808749e-08 1.348200e-03
gene3   542.0637       2.819978 0.2780959 10.140308 3.659435e-24 1.086907e-19
gene4   495.0726       6.384450 0.3230821 19.761072 6.442145e-87 3.826827e-82

When setting betaPrior=FALSE, DESeq results in more reasonable pvalues:

ex <- DESeq(ex, betaPrior=FALSE, parallel = TRUE)
res <- results(ex)
head(res, 4)
log2 fold change (MLE): condition B vs A
Wald test p-value: condition B vs A
DataFrame with 4 rows and 6 columns
baseMean log2FoldChange     lfcSE      stat       pvalue         padj
<numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
gene1 39776.4660       10.13320 1.1232360  9.021438 1.856363e-19 2.695687e-15
gene2   496.9917        7.80994 0.9318235  8.381351 5.232344e-17 5.698546e-13
gene3   542.0637        3.38759 0.3342198 10.135813 3.831741e-24 8.346299e-20
gene4   495.0726       10.75123 0.9090859 11.826420 2.850400e-32 1.241748e-27

I had not expected such an extreme behavior of this default parameter, meaning, you could easily lose interesting genes when doing the default analysis.

By the way, modelMatrixType = "standard" and parallel = TRUE throws an error (parallel = FALSE works):

ex <- DESeq(ex, modelMatrixType = "standard", parallel = TRUE)
using pre-existing size factors
estimating dispersions
gene-wise dispersion estimates: 3 workers
mean-dispersion relationship
found already estimated fitted dispersions, removing these
final dispersion estimates, MLE betas: 3 workers
fitting model and testing: 3 workers
Error: BiocParallel errors
element index: 1, 2, 3
first error: betaPriorVar should have length 2 to match: (Intercept), conditionB


deseq2 betaprior • 509 views
modified 2.7 years ago by thomas.siegmund0 • written 2.7 years ago by akaever30
0
2.7 years ago by
Michael Love23k
United States
Michael Love23k wrote:

"I had not expected such an extreme behavior of this default parameter"

This kind of an artificial dataset doesn't really resemble real RNA-seq data and therefore the default doesn't work well.

Such an example is actually mentioned in the DESeq2 documentation, that for such an artificial dataset one should not use the default value of betaPrior: see Frequently Asked Question 5.10 in the vignette about over-shrinking large differences and Figure 18.

We had observed the first and last row of the spikeIn matrix in a real data set. It took some time to figure out what went wrong. In order to demonstrate that the issue is with the single gene and that not the whole data set is the problem, I tested it in the context of a random data set.

My experience is, that one encounters "artificial" looking data sets more often than expected. When running the default analysis, one would not even notice such extreme profiles.

Right now, for a standard analysis, I would deactivate all standard options, at the cost of more false-positives (because otherwise interesting candidates get lost):

dds <- DESeq(dds, minReplicatesForReplace = Inf, betaPrior = FALSE)
res <- results(dds, cooksCutoff = FALSE, independentFiltering = FALSE)

It's not that those rows of counts that are a problem, but embedding these in a simulated dataset with ~60,000 genes where LFC = 0.

But certainly, the beta prior is best turned off for certain datasets as we say in the FAQ.

How can it be explained that for the last row in the spikeIn matrix still a quite high log2FoldChange and low padj are calculated (betaPrior=TRUE)?

The dispersion estimate will be higher for the first gene than the last one. The p-values are driven both by the LFC and the estimates of within-group variance (controlled by the dispersion parameter).

0
2.7 years ago by
thomas.siegmund0 wrote:

Hi Michael,

this came from a pretty normal data set, comparing two closely related cell lines in a time series. I wouldn't mind if DESeq2 failed on a synthetic data set. But in this case there was no reason whatsoever to assume that the data might need some special treatment like switching off default options. Also the resulting top table looked pretty normal. There was not hint that it might lack critical information.

If you were interested to look into this in more detail, we might be able to provide the data set.

Best

Thomas

Dr. Thomas Siegmund

Principal Scientist

Evotec International GmbH