Negative binomial fitting DESeq2
1
0
Entering edit mode
Lucy ▴ 20
@lucy-17014
Last seen 6 weeks ago
United Kingdom

Hi,

I apologise for the naive question but after reading the DESeq2 paper and vignette, I am still not sure I fully understand how the method works.

My questions are as follows:

1. As one dispersion parameter is calculated per gene, does the calculation ignore the group membership of each sample, and is this also true for the mean parameter?

2. Is a single negative binomial model fit per gene, so this assumes that the distribution of counts for each condition e.g. genotype is the same, or have I misunderstood this?

3. What is the reason that you can't just perform a t-test comparison? Is the reason for generating a negative binomial model so that you can estimate the variance you would expect if you collected more samples?

4. How does the Wald test work to identify DEGs? Does it essentially look at where the samples from a given condition lie within the NB distribution for that gene?

Thanks for any explanations/clarifications. My lack of stats training is really hindering me here!

Best wishes,

Lucy

DESeq2 RNA-seq • 158 views
0
Entering edit mode
@mikelove
Last seen 3 days ago
United States

See these equations here (they are also in the Methods of the 2014 paper):

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#the-deseq2-model

we have

mu_ij = s_j q_ij

and

log2(q_ij) = X_j,. beta_i

The only way the mean parameter mu_ij would be equal across samples j is if s_j was constant (usually it is not), and additionally if X_j,., the j-th row of the design matrix was equal for all samples. This only happens when design=~1 (intercept only model). So in all other cases then mu_ij is not equal for all samples. Suppose that s_j was equal to 1 for all samples (samples have equal sequencing depth), then mu_ij would be equal for samples that have the same covariates values for variables in the design formula (e.g. samples in the same group).

The rest of question (1) is answered below, see: "The dispersion parameter alpha_i defines the relationship ..."

3) a t-test is under-powered for counts, and DESeq2 additionally benefits from examining thousands of genes, to have a better estimate of the dispersion than a method that looks at one gene alone.

4) The Wald test examines the coefficient associated with differences between groups, e.g. an element of the column vector beta_i, and divides it by its standard error. The formula for the SE is given in the Methods section of the 2014 paper. Then inference follows from standard procedures for when we divide a parameter by its standard error, we compare it to a unit normal, and finally we correct p-value for multiple testing.

0
Entering edit mode

Thank you. With regards to question 1, I am still unclear as to how you can have a single NB model or equation per gene if the mean parameter is sample-specific. I am imagining one curve that shows the distribution of counts in a given gene, but if the mean is sample-specific, then should I be imagining one curve per sample per gene?

Would you be able to provide a theoretical example of an equation for a gene (say with the design formula ~condition) and how you would plug in the relevant values to get back to the original counts. I understand how the 1s and 0s in the design matrix switch on and off particular values in the equation.

0
Entering edit mode

We don't have a single NB model per gene, right? Look again at equation 1:

K_ij ~ NB(mu_ij, alpha_i)


The count for gene i and sample j is given as a NB with a gene and sample specific mean mu_ij and dispersion alpha_i.

0
Entering edit mode

I see, so why don't you end up with separate beta coefficients for each sample/gene combination rather than beta coefficients that represent the fold change of a given gene in e.g. group B vs. group A?

0
Entering edit mode

I don't follow. For gene i, we have a single column vector beta_i which gives the intercept and then log fold changes associated with covariates. E.g. if the design is ~batch + condition, and each have two levels, then the beta_i column vector would have three elements, e.g.: Intercept, batch2, conditionB...

You may want to consult with a textbook on how linear models are represented, it's typical to write:

y_j ~ N(beta_0 + beta_1 x_j, sigma^2)


So beta_0 and beta_1 are shared across the samples j. We could rewrite this as:

y_j ~ N(mu_j, sigma)
mu_j = beta_0 + beta_1 x_j