Question

Highly differentially expressed genes not found with DESeq2 and default betaPrior

0

Entering edit mode

akaever ▴ 30

@akaever-7380

Last seen 7.2 years ago

Hi,

the default DESeq2::DESeq(betaPrior=TRUE) parameter seems to prevent finding single highly significant genes (e.g. inserted into a random data set):

library(DESeq2)

ex <- makeExampleDESeqDataSet(n = 60000, m = 10)
design(ex) <- ~ condition

spikeIn <- matrix(as.integer(c(5,0,177,69,119, 57842,27801,105138,97698,126860,
                               10,11,0,0,2, 806,319,1500,1397,1150,
                               100,110,90,99,95, 806,319,1500,1397,1150,
                               0,1,0,1,1, 806,319,1500,1397,1150)),
                  byrow = TRUE, nrow = 4)
counts(ex)[1:nrow(spikeIn), ] <- spikeIn

ex <- DESeq(ex, parallel = TRUE)
res <- results(ex)
head(res, 4)

log2 fold change (MAP): condition B vs A 
Wald test p-value: condition B vs A 
DataFrame with 4 rows and 6 columns
        baseMean log2FoldChange     lfcSE      stat       pvalue         padj
       <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
gene1 39776.4660       1.283372 0.3460637  3.708485 2.085033e-04 3.347492e-01
gene2   496.9917       1.993635 0.3694553  5.396146 6.808749e-08 1.348200e-03
gene3   542.0637       2.819978 0.2780959 10.140308 3.659435e-24 1.086907e-19
gene4   495.0726       6.384450 0.3230821 19.761072 6.442145e-87 3.826827e-82

When setting betaPrior=FALSE, DESeq results in more reasonable pvalues:

ex <- DESeq(ex, betaPrior=FALSE, parallel = TRUE)
res <- results(ex)
head(res, 4)

log2 fold change (MLE): condition B vs A 
Wald test p-value: condition B vs A 
DataFrame with 4 rows and 6 columns
        baseMean log2FoldChange     lfcSE      stat       pvalue         padj
       <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
gene1 39776.4660       10.13320 1.1232360  9.021438 1.856363e-19 2.695687e-15
gene2   496.9917        7.80994 0.9318235  8.381351 5.232344e-17 5.698546e-13
gene3   542.0637        3.38759 0.3342198 10.135813 3.831741e-24 8.346299e-20
gene4   495.0726       10.75123 0.9090859 11.826420 2.850400e-32 1.241748e-27

I had not expected such an extreme behavior of this default parameter, meaning, you could easily lose interesting genes when doing the default analysis.

By the way, modelMatrixType = "standard" and parallel = TRUE throws an error (parallel = FALSE works):

ex <- DESeq(ex, modelMatrixType = "standard", parallel = TRUE)

using pre-existing size factors
estimating dispersions
gene-wise dispersion estimates: 3 workers
mean-dispersion relationship
found already estimated fitted dispersions, removing these
final dispersion estimates, MLE betas: 3 workers
fitting model and testing: 3 workers
Error: BiocParallel errors
  element index: 1, 2, 3
  first error: betaPriorVar should have length 2 to match: (Intercept), conditionB

deseq2 betaprior • 1.3k views

ADD COMMENT • link updated 7.6 years ago by thomas.siegmund • 0 • written 7.6 years ago by akaever ▴ 30

score 0 · Answer 1 · 2016-09-01

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 3 hours ago

United States

"I had not expected such an extreme behavior of this default parameter"

This kind of an artificial dataset doesn't really resemble real RNA-seq data and therefore the default doesn't work well.

Such an example is actually mentioned in the DESeq2 documentation, that for such an artificial dataset one should not use the default value of betaPrior: see Frequently Asked Question 5.10 in the vignette about over-shrinking large differences and Figure 18.

ADD COMMENT • link 7.6 years ago Michael Love 41k

0

Entering edit mode

We had observed the first and last row of the spikeIn matrix in a real data set. It took some time to figure out what went wrong. In order to demonstrate that the issue is with the single gene and that not the whole data set is the problem, I tested it in the context of a random data set.

My experience is, that one encounters "artificial" looking data sets more often than expected. When running the default analysis, one would not even notice such extreme profiles.

Right now, for a standard analysis, I would deactivate all standard options, at the cost of more false-positives (because otherwise interesting candidates get lost):

dds <- DESeq(dds, minReplicatesForReplace = Inf, betaPrior = FALSE)
res <- results(dds, cooksCutoff = FALSE, independentFiltering = FALSE)

ADD REPLY • link 7.6 years ago akaever ▴ 30

0

Entering edit mode

It's not that those rows of counts that are a problem, but embedding these in a simulated dataset with ~60,000 genes where LFC = 0.

But certainly, the beta prior is best turned off for certain datasets as we say in the FAQ.

ADD REPLY • link 7.6 years ago Michael Love 41k

0

Entering edit mode

How can it be explained that for the last row in the spikeIn matrix still a quite high log2FoldChange and low padj are calculated (betaPrior=TRUE)?

ADD REPLY • link 7.6 years ago akaever ▴ 30

0

Entering edit mode

The dispersion estimate will be higher for the first gene than the last one. The p-values are driven both by the LFC and the estimates of within-group variance (controlled by the dispersion parameter).

ADD REPLY • link 7.6 years ago Michael Love 41k

score 0 · Answer 2 · 2016-09-01

0

Entering edit mode

thomas.siegmund • 0

@thomassiegmund-11406

Last seen 7.6 years ago

Hi Michael,

this came from a pretty normal data set, comparing two closely related cell lines in a time series. I wouldn't mind if DESeq2 failed on a synthetic data set. But in this case there was no reason whatsoever to assume that the data might need some special treatment like switching off default options. Also the resulting top table looked pretty normal. There was not hint that it might lack critical information.

If you were interested to look into this in more detail, we might be able to provide the data set.

Best

Thomas

Dr. Thomas Siegmund

Principal Scientist

Evotec International GmbH

ADD COMMENT • link 7.6 years ago thomas.siegmund • 0

0

Entering edit mode

I see where you are coming from, and have been working on models for alternative shrinkage priors for effect sizes in RNA-seq, but I do think that the documentation and support for this concern is sufficient. We built out functionality to visualize the un-moderated fold changes from plotMA, and to add them to the results table using results(). It just happens that the most common LFC distribution for datasets we see is not a delta function at 0 with a handful of genes with LFC=10.

ADD REPLY • link 7.6 years ago Michael Love 41k