Question

Dispersion calculation by DESeq2

0

Entering edit mode

eva-m.petit • 0

@eva-mpetit-19601

Last seen 7.0 years ago

Hello,

Can someone help me to understand the dispersion calculation of DESeq2?

I don't understand what does DESeq2 :

1) If we consider two conditions A and B with n samples in each conditions, are the informations shared between all samples from both the two conditions ? That means are the mean normalized counts calculated from counts of cond. A (22) and cond. B (25) so µ=23,5 ? And the dispersion is calculated according to µ values of all genes?

But in this case my understanding is that differentially expressed genes will not be necessarily more dispersed that genes very expressed but not differentially expressed between cond. A and B. Is that true ?

An example :

            cond. A      cond. B       µ

gene 1 22 25 23.5

gene 2 572 620 646

gene 3 40 700 370

In this case, if the "common" mean normalized counts for all genes is around 100, gene 3 will be less dispersed than gene 2 but more differentially expressed (and gene 3 will not be an outlier surrounded in blue on the dispersion plot but it dispersion wil be shrunked by MAP procedure).

2) Or does DESeq2 calculates the mean normalized counts for cond. A (µA) and cond. B (µB) ? And then it calculates dispersions for each condition and evaluate difference between dispersion of cond. A and cond B.? I think this proposal is much more similar to a test of differential expression than to the dispersion calculation.

Is a proposal true between the two ? Am I misunderstanding something ? What are the mistakes in my explanation ?

Any comments and help on understanding this would be greatly appreciated, Thank you, E.

DESeq2 dispersion • 3.3k views

ADD COMMENT • link 7.0 years ago eva-m.petit • 0

0

Entering edit mode

Hello Michael,

Thank you for your answer, sorry for the trivial questions... I read the paper but I am beginner in differential expression analysis and biostatistics, so popularization is useful.

Do you mean that each sample has is own µ taking mean of normalized counts of all genes taken together in a single sample ? And then gene individual dispersion is calculated taking the normalized count value of gene g in the condition A, sample 1 and the associated µ of condition A, sample 1. So each gene has its own dispersion for each sample of each condition.

So how is calculated the dispersion represented by black dots on the dispersion plot ? Does it calculate difference betweens dispersions of each sample whatever the condition (like dispersion of samples around of the mean of dispersions)? Or does it calculate the difference between dispersions calculated in cond. A and those calculated in cond. B?

Sorry, I might be wrong, I try to understand with "simple words".

Thank you,

Best,

Eva

ADD REPLY • link 7.0 years ago eva-m.petit • 0

0

Entering edit mode

It's best to keep with the notation in the paper, so: mu_ij is a size factor times q_ij, where q_ij is going to be shared across samples in the same group, for a given gene j.

The gene-wise estimate of dispersion is calculated by finding the value of alpha that maximizes the likelihood of the data, where we have fixed the mu_ij to their best estimates. The posterior dispersion is more difficult and requires some knowledge of Bayesian models, but that at least gives you some idea how the dispersion might be estimated. See also this related post:

https://www.biostars.org/p/127756/

ADD REPLY • link 7.0 years ago Michael Love 43k

0

Entering edit mode

Thank you Michael.

I understand that mu-bari is the mean of normalized counts in all samples of all conditions (relative to sij, the normalization factor).

Is mu-hatij what you call the best estimate of muij? How is obtained mu-hatij? So dispersion is calculated from mu-hatij and not mu-bar_i ? Why ?

Another question: What do you mean by "elements of the design matrix X" for x_jr ?

Best,

Eva

ADD REPLY • link 7.0 years ago eva-m.petit • 0

0

Entering edit mode

hat-mu_ij is obtained by maximizing the likelihood.

Your second question is answered in the 2014 paper: "We then maximize the Cox–Reid adjusted likelihood of the dispersion, conditioned on the fitted values hat-mu_ij"

The design matrix X is used throughout linear modeling. Maybe check a reference on what a design matrix looks like. x_jr is the j-th row and r-th column of this matrix.

ADD REPLY • link 7.0 years ago Michael Love 43k

score 0 · Answer 1 · 2019-01-25

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

The answer is closer to (2) that each group gets a normalized count mean, but actually each sample gets its own mu when you take the normalization factors into account.

Take a look at the Methods in the 2014 paper for details on how the software works

ADD COMMENT • link 7.0 years ago Michael Love 43k