Hello,

Can someone help me to understand the dispersion calculation of DESeq2?

I don't understand what does DESeq2 :

1) If we consider two conditions A and B with n samples in each conditions, are the informations shared between all samples from both the two conditions ? That means are the mean normalized counts calculated from counts of cond. A (22) and cond. B (25) so µ=23,5 ? And the dispersion is calculated according to µ values of all genes?

But in this case my understanding is that differentially expressed genes will not be necessarily more dispersed that genes very expressed but not differentially expressed between cond. A and B. Is that true ?

An example :

```
cond. A cond. B µ
```

gene 1 22 25 23.5

gene 2 572 620 646

gene 3 40 700 370

In this case, if the "common" mean normalized counts for all genes is around 100, gene 3 will be less dispersed than gene 2 but more differentially expressed (and gene 3 will not be an outlier surrounded in blue on the dispersion plot but it dispersion wil be shrunked by MAP procedure).

2) Or does DESeq2 calculates the mean normalized counts for cond. A (µA) and cond. B (µB) ? And then it calculates dispersions for each condition and evaluate difference between dispersion of cond. A and cond B.? I think this proposal is much more similar to a test of differential expression than to the dispersion calculation.

Is a proposal true between the two ? Am I misunderstanding something ? What are the mistakes in my explanation ?

Any comments and help on understanding this would be greatly appreciated, Thank you, E.

Hello Michael,

Thank you for your answer, sorry for the trivial questions... I read the paper but I am beginner in differential expression analysis and biostatistics, so popularization is useful.

Do you mean that each sample has is own µ taking mean of normalized counts of all genes taken together in a single sample ? And then gene individual dispersion is calculated taking the normalized count value of gene g in the condition A, sample 1 and the associated µ of condition A, sample 1. So each gene has its own dispersion for each sample of each condition.

So how is calculated the dispersion represented by black dots on the dispersion plot ? Does it calculate difference betweens dispersions of each sample whatever the condition (like dispersion of samples around of the mean of dispersions)? Or does it calculate the difference between dispersions calculated in cond. A and those calculated in cond. B?

Sorry, I might be wrong, I try to understand with "simple words".

Thank you,

Best,

Eva

It's best to keep with the notation in the paper, so:

`mu_ij`

is a size factor times`q_ij`

, where`q_ij`

is going to be shared across samples in the same group, for a given gene j.The gene-wise estimate of dispersion is calculated by finding the value of alpha that maximizes the likelihood of the data, where we have fixed the

`mu_ij`

to their best estimates. The posterior dispersion is more difficult and requires some knowledge of Bayesian models, but that at least gives you some idea how the dispersion might be estimated. See also this related post:https://www.biostars.org/p/127756/

Thank you Michael.

I understand that mu-bar

i is the mean of normalized counts in all samples of all conditions (relative to sij, the normalization factor).Is mu-hat

ij what you call the best estimate of muij? How is obtained mu-hatij? So dispersion is calculated from mu-hatij and not mu-bar_i ? Why ?Another question: What do you mean by "elements of the design matrix X" for x_jr ?

Best,

Eva

`hat-mu_ij`

is obtained by maximizing the likelihood.Your second question is answered in the 2014 paper: "We then maximize the Cox–Reid adjusted likelihood of the dispersion, conditioned on the fitted values

`hat-mu_ij`

"The design matrix X is used throughout linear modeling. Maybe check a reference on what a design matrix looks like.

`x_jr`

is the j-th row and r-th column of this matrix.