Question: Deseq2 results discrepancies because of versions of R
0
3.6 years ago by
genomica0
United Kingdom
genomica0 wrote:

Hello all,

I was asked to repeat RNAseq analysis done a while ago. Previously, analysis was done using Deseq2 v1.2.10 and R v.3.0.2. Although we are still using the same version for Deseq2, R on our system is now upgraded to 3.2.1. Repeating exactly same analysis using the same code but with newer version of R, gives me half the number of genes as significantly expressed.

Does anyone know why this discrepancy could be?

deseq2 • 1.3k views
modified 3.6 years ago by Michael Love23k • written 3.6 years ago by genomica0
Answer: Deseq2 results discrepancies because of versions of R
3
3.6 years ago by
Michael Love23k
United States
Michael Love23k wrote:

Version 1.0 and 1.2 were the first 2 releases of DESeq2, in April and October 2013. We received a lot of feedback from users and so v1.4 incorporated those changes. Since v1.4, the methods for parameter estimation have remained much more stable, which leads to more stable list of DEG. Since v1.4 we focus on usability and extra features, but not on changing methods for parameter estimation. All important changes are documented in the NEWS file accompanying the package (although this might not be useful as it's hard to tell which changes are the ones that contributed to a difference).

~~~

### General advice on why numbers of DEG change with small changes in parameters:

For more detail on how the number of DEG could have changed, it's important to remember that p-values are tail probabilities. This makes them especially sensitive to change in how you estimate parameters.

For an example of how tail probabilities are sensitive, suppose we observe 10,000 random variables from a standard normal distribution. How many of these are in the lower 1% tail of the distribution we drew these from?

> set.seed(1)
> z <- rnorm(10000, 0, 1)
> sum(pnorm(z, 0, 1) < .01)
[1] 115

We expect this to be around 100. Note there is already variance because Z is a random sample.

Now, suppose we make a 5% change in the standard deviation of the distribution we use to evaluate Z:

> sum(pnorm(z, 0, 0.95) < .01)
[1] 167

After a 5% change in a parameter -- here, underestimating the standard deviation -- we observe a 45% increase in the number of random variables in the lower tail.

What if we increase the parameter by 5%:

> sum(pnorm(z, 0, 1.05) < .01)
[1] 83

This gives a 28% decrease in the number in the lower tail.

My point with this example is that small changes in parameters are amplified when we compute the number of observations (for differential expression, read: number of genes), that fall into the tails.

Dear Michael,

Thank you for the clarification with the versions and parameter estimation! I'd like to ask if I still may experience considerable changes in the number of differentially regulated genes (under some adjusted p-value threshold) when I move from version 1.14.0 to version 1.18.1 of DESeq2?

With these versions of DESeq2, I ran the same code on the same data with the same threshold for the adjusted p-value, and got the following numbers of significantly regulated genes:

 Version / regulation Up Down 1.14.0 1689 1397 1.18.1 1929 1493

So, it seems that v. 1.18.1 could be more sensitive than v. 1.14.0? I went through the NEWS file of the v. 1.18.1, but I'm not sure I can fully understand in what way changes between these two versions could affect the number of genes.

The code was as follows:

dds <- DESeqDataSetFromMatrix(countData=round(expr, 0), colData=condition.table, design = ~ Condition)
ddsM <- DESeq(dds)
res  <- as.data.frame(results(ddsM, contrast = c("Condition", target.condition, base.condition)))

where "expr" is a raw count matrix produced by RSEM. It was the same for both runs.

Regarding other software, with DESeq2 v. 1.18.1 I used R v. 3.4.4 on Windows 7, and with DESeq2 v. 1.14.0 I used R v. 3.3.2 on Ubuntu.

Thanks!

1

We made a change in 1.16 where beta prior is no longer used for statistical testing only for visualization and ranking. The number above are not surprisingly to me, it’s just adding ~10% to the set, and it’s really just moving the threshold usually. If you need the *exact* same set of genes for a long term project, you can keep a version of R on your machine or cluster with a specific version of Bioconductor.

Dear Michael,

Thank you so much for the clarification!

Answer: Deseq2 results discrepancies because of versions of R
0
3.6 years ago by
genomica0
United Kingdom
genomica0 wrote:

Please ignore the previous post. It was my bad in noting the versions of Deseq2. Its v1.2.10 for R3.0.2 and v1.10 for R3.2.0. Sorry!