Question: Highly differentially expressed genes not found with DESeq2 and default betaPrior
0
gravatar for akaever
3.1 years ago by
akaever30
akaever30 wrote:

Hi,

the default DESeq2::DESeq(betaPrior=TRUE) parameter seems to prevent finding single highly significant genes (e.g. inserted into a random data set):

library(DESeq2)

ex <- makeExampleDESeqDataSet(n = 60000, m = 10)
design(ex) <- ~ condition

spikeIn <- matrix(as.integer(c(5,0,177,69,119, 57842,27801,105138,97698,126860,
                               10,11,0,0,2, 806,319,1500,1397,1150,
                               100,110,90,99,95, 806,319,1500,1397,1150,
                               0,1,0,1,1, 806,319,1500,1397,1150)),
                  byrow = TRUE, nrow = 4)
counts(ex)[1:nrow(spikeIn), ] <- spikeIn

ex <- DESeq(ex, parallel = TRUE)
res <- results(ex)
head(res, 4)
log2 fold change (MAP): condition B vs A 
Wald test p-value: condition B vs A 
DataFrame with 4 rows and 6 columns
        baseMean log2FoldChange     lfcSE      stat       pvalue         padj
       <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
gene1 39776.4660       1.283372 0.3460637  3.708485 2.085033e-04 3.347492e-01
gene2   496.9917       1.993635 0.3694553  5.396146 6.808749e-08 1.348200e-03
gene3   542.0637       2.819978 0.2780959 10.140308 3.659435e-24 1.086907e-19
gene4   495.0726       6.384450 0.3230821 19.761072 6.442145e-87 3.826827e-82

 

When setting betaPrior=FALSE, DESeq results in more reasonable pvalues:

ex <- DESeq(ex, betaPrior=FALSE, parallel = TRUE)
res <- results(ex)
head(res, 4)
log2 fold change (MLE): condition B vs A 
Wald test p-value: condition B vs A 
DataFrame with 4 rows and 6 columns
        baseMean log2FoldChange     lfcSE      stat       pvalue         padj
       <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
gene1 39776.4660       10.13320 1.1232360  9.021438 1.856363e-19 2.695687e-15
gene2   496.9917        7.80994 0.9318235  8.381351 5.232344e-17 5.698546e-13
gene3   542.0637        3.38759 0.3342198 10.135813 3.831741e-24 8.346299e-20
gene4   495.0726       10.75123 0.9090859 11.826420 2.850400e-32 1.241748e-27

 

I had not expected such an extreme behavior of this default parameter, meaning, you could easily lose interesting genes when doing the default analysis.

 

By the way, modelMatrixType = "standard" and parallel = TRUE throws an error (parallel = FALSE works):

ex <- DESeq(ex, modelMatrixType = "standard", parallel = TRUE)
using pre-existing size factors
estimating dispersions
gene-wise dispersion estimates: 3 workers
mean-dispersion relationship
found already estimated fitted dispersions, removing these
final dispersion estimates, MLE betas: 3 workers
fitting model and testing: 3 workers
Error: BiocParallel errors
  element index: 1, 2, 3
  first error: betaPriorVar should have length 2 to match: (Intercept), conditionB

 

 

 

 
deseq2 betaprior • 580 views
ADD COMMENTlink modified 3.1 years ago by thomas.siegmund0 • written 3.1 years ago by akaever30
Answer: Highly differentially expressed genes not found with DESeq2 and default betaPrio
0
gravatar for Michael Love
3.1 years ago by
Michael Love25k
United States
Michael Love25k wrote:

"I had not expected such an extreme behavior of this default parameter"

This kind of an artificial dataset doesn't really resemble real RNA-seq data and therefore the default doesn't work well.

Such an example is actually mentioned in the DESeq2 documentation, that for such an artificial dataset one should not use the default value of betaPrior: see Frequently Asked Question 5.10 in the vignette about over-shrinking large differences and Figure 18.

ADD COMMENTlink written 3.1 years ago by Michael Love25k

We had observed the first and last row of the spikeIn matrix in a real data set. It took some time to figure out what went wrong. In order to demonstrate that the issue is with the single gene and that not the whole data set is the problem, I tested it in the context of a random data set.

My experience is, that one encounters "artificial" looking data sets more often than expected. When running the default analysis, one would not even notice such extreme profiles. 

Right now, for a standard analysis, I would deactivate all standard options, at the cost of more false-positives (because otherwise interesting candidates get lost):

dds <- DESeq(dds, minReplicatesForReplace = Inf, betaPrior = FALSE)
res <- results(dds, cooksCutoff = FALSE, independentFiltering = FALSE)

 

 

ADD REPLYlink written 3.1 years ago by akaever30

It's not that those rows of counts that are a problem, but embedding these in a simulated dataset with ~60,000 genes where LFC = 0.

But certainly, the beta prior is best turned off for certain datasets as we say in the FAQ.

ADD REPLYlink written 3.1 years ago by Michael Love25k

How can it be explained that for the last row in the spikeIn matrix still a quite high log2FoldChange and low padj are calculated (betaPrior=TRUE)?

ADD REPLYlink written 3.1 years ago by akaever30

The dispersion estimate will be higher for the first gene than the last one. The p-values are driven both by the LFC and the estimates of within-group variance (controlled by the dispersion parameter).

ADD REPLYlink written 3.1 years ago by Michael Love25k
Answer: Highly differentially expressed genes not found with DESeq2 and default betaPrio
0
gravatar for thomas.siegmund
3.1 years ago by
thomas.siegmund0 wrote:

Hi Michael,

this came from a pretty normal data set, comparing two closely related cell lines in a time series. I wouldn't mind if DESeq2 failed on a synthetic data set. But in this case there was no reason whatsoever to assume that the data might need some special treatment like switching off default options. Also the resulting top table looked pretty normal. There was not hint that it might lack critical information. 

If you were interested to look into this in more detail, we might be able to provide the data set.

Best

Thomas 

Dr. Thomas Siegmund

Principal Scientist

Evotec International GmbH

ADD COMMENTlink written 3.1 years ago by thomas.siegmund0

I see where you are coming from, and have been working on models for alternative shrinkage priors for effect sizes in RNA-seq, but I do think that the documentation and support for this concern is sufficient. We built out functionality to visualize the un-moderated fold changes from plotMA, and to add them to the results table using results(). It just happens that the most common LFC distribution for datasets we see is not a delta function at 0 with a handful of genes with LFC=10.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Michael Love25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 189 users visited in the last hour