Hi everyone,

I am in a basic level of statistics and using DESeq package for RNA-seq. I don't understand how DESeq calculates p-value exactly. If we simplify statistics, for two sample mean comparison we either use z-test or t-test, right?

How about in DESeq?

In DESeq paper :

**H°: qiA = qiB means Hª: qiA != qiB **

So:

**KiA = sum(Kij) , KiB = sum(Kij)**

Finally,

** Pi = sum p(a,b) / sum p(a,b)** //sorry I couldn't write the notations well as there is no enough options here.

1- So please someone can explain how this can be calculated?

2- Does DESeq use Negative Binomial to estimate **Mean** and **Varience** then plug these two in t-test to calculate p-value?

Thanks for any useful info. Through your answers, we can discuss it more.

Yours,

Giard

This is just a DESeq question so I'm removing the DESeq2 tag.

I wonder if DESeq2 changed p-value computation compared to DESeq?

If it is changed, how DESeq2 does it then? Because after all we will be used the latest which is DESeq2.

By default it will perform a wald test

I really don't get it. thousands of researchers using and asking about expression analysis and DESeq while no one can give a proper explanation on basic questions.

@ at least thanks for those above who tried to say some.

I don't understand why you sound so upset.

As Steve pointed out, the p value calculation in DESeq2 is

entirelydifferent from the one described in the 2010 paper on DESeq-1.Hence, I am hesitant spending time writing an answer to your question if you are not interested in it anyway, as you have stated yourself, above.

I assume that you have just mistakenly read the wrong paper, i.e., spend time trying to understand our 2010 Genome Biology paper, when you should have been reading our 2015 Genome Biology paper. Please read it and come back if you have questions.

And if you really need to know about DESeq-1, please ask again.

Thanks Simon, I don't have a right to be upset as I am one of those looking for help NOT bossing around :)

As I am new into this and watched some tutorials as well as starting with DESeq-1, that is why I am so willing to understand it first then go to DESeq-2.

To state it again, in addition to the main question, I am trying to see how P-value calculated when we have(estimated) Mean and Variance from the data? as I gave an example above that for newbie statistics, one can calculate p-value by having Mean and Variance. So what and how was used here in DESeq-1 to calculate that p-value?

Appreciated

Calculating a p-value does not always involve calculating a mean and variance, and a mean and variance estimated from the data are certainly not sufficient to calculate a p-value, especially in the case of RNA-seq read counts, which are not normally distributed. Calculating a p-value involves calculating a test statistic from the data and then comparing it to the distribution of that statistic under the null hypothesis. Specifically, you want to find the "tail probability", i.e. the probability that the null distribution is at least as extreme as the observed test statistic. This means that any method that provides a test statistic and a null distribution to compare it against can be used to obtain a p-value. The DESeq 1 paper describes one such method. However, as other have advised, DESeq2 doesn't really build on the methods of DESeq 1, so understanding DESeq 1 is not much of a stepping stone to DESeq2. The DESeq2 paper and manual together provide an excellent and self-contained description of the DESeq2 method, and would be a good starting point. If you have some more time, I would recommend introducing yourself to linear models, then reading about the limma package (which is based on linear models with some genomics-specific tweaks), and then reading the latest papers about DESeq2 and edgeR, which are two similar methods, both based on generalized linear models.

Basically, the only reason to use DESeq 1 now that DESeq2 is available is if you already did some analysis with DESeq 1 and therefore have to continue your analysis with it for the sake of consistency (e.g. being able to compare results to a previous experiment).

Thanks. I would definitely follow your suggested topics. But I'd like to take the chance and asking a couple of more questions related, please:

1- You mentioned that estimated

MeanandVarianceare not used to calculatep-value, then why are they estimated in the first place if they are not plugged(used) in the test statistics for p-value calculation(testing differential expression)? namely, If count reads and condition replicates not used for this, then what are used in test statistics and p-value?Note: I think

MeanandVariancewere used to modelNegative Binomial Distribution. And what is the benefit of modellingNBif not used in differential expression testing, don't know.2- Examples of test statistics used to whether to reject Null hypothesis(finding p-value) are t-statistics, z-statistics, chi-square,...etc. Which test statistics used here in DESeq-1 then?

1. I said that not all p-value calculations involve a mean and variance, and that an estimated mean and variance by themselves are never sufficient by themselves to calculate a p-value. I didn't say that estimated mean and variance were never used for calculating p-values. I didn't go into the details of the DESeq 1 calculation, but its calculations do involve the mean and variance of the estimated negative binomial distribution (although the NB is typically described in terms of dispersion instead of variance, where variance = mean + dispersion * mean^2). Regardless, there are quite a few steps in between estimating the distributional parameters and calculating the p-value, so the p-value is not calculated directly from the mean and variance.

2. DESeq 1 doesn't use any of those test statistics. It defines a custom test statistic designed specifically for RNA-seq count data, and the paper fully describes all the properties of this statistic required to perform differential expression testing. This is one reason you probably shouldn't start your learning with DESeq 1. limma is probably the easiest to understand. For RNA-seq, all the cleverness of limma goes into the variance estimation, and the test statistic is just a t-statistic (or F-statistic). edgeR and DESeq2 can both use a likelihood ratio test, which has an (asymtotically) chi-square test statistic. (They both also have other tests as well, but the LRT is probably the most well-known one.)

Thanks for that. Time to start my journey toward limma and DESeq-2.

I think the question is good and justified for the p value calculation in DESeq-1. In the paper from 2010 and in the supplementary 1 section D it is very hard to understand how do you get the p value of your DE-test. Its not really clear how you get all the values in the example. So its could be interesting also for me , how it work....

I will echo you here. I also have problems in understanding how P value was calculated, but to illustrate my questions specifically, I would like to use an example: For example, I want to see if gene A has significant change or not after treatment from RNA-seq data. Say, I have 2 samples for control and 2 samples for treatment. the expression value for A in 2 control sample is A1 and A2 while in treatment sample is A3 and A4. If I wanted to see if mean (A1, A2) is significantly different from mean (A3, A4) I would like to use a null hypothesis H0: mean(A1, A2)=mean(A3, A4), while H1: mean(A1, A2)!=mean(A3, A4) . I don't have the mean and variance of gene A expression from population but only have it from sample (A1, and A2).

My question is : where does this comparison of individual genes has the connection with the negative binomial distribution for all the genes' expression distribution in sample?