**0**wrote:

Please excuse if this is not the right place to post this question. I am trying to wrap my head around the exact details of the DESeq2 method in order to reimplement a simplified version of it in Stan for training purposes.

At the moment I am struggling to grasp how means for the negative binomial distribution of each gene count* K_ij* are computed.

## Information given in the paper from from Love, Huber, Anders (2010)

In the paper (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8#Materialsandmethods), the following is detailled:

*K_ij ~ NB(mu_ij; alpha_i)*, so the count K for gene i in sample j is negative binomially distributed with mean mu_ij and dispersion alpha_i.

the mean mu_ij is specified as *mu_ij = s_ij * q_ij*

with *s_ij *=* s_j* = a normalization factor and *q_ij* = a quantity that is proportional to the concentration of the cDNA fragments of gene i in the sample j.

*q_ij* is modeled as *log q_ij = sum (x_jr * beta_ir)*

## My questions

In the paper they say they start by fitting a negative binomial GLM to the gene's count data, to get initial estimates *mu_ij.hat*.

1) how does this model exactly look like written down and how do I get my initial estimates *mu_ij.hat*?

2) Are the raw gene counts the dependent variable, or are the *q_ij*? If so, how exactly do I derive the *q_ij*? If gene counts are used: at what point do I account for different library sizes?

3) Some lines above the paper states that the DESeq2 design matrix does not have full rank, but that a unique solution exists due to the zero-centered prior distribution. At this step I am not using the prior, do I nevertheless use a rank-deficient design matrix for this step?

## My attempt at replicating what is done in R

To gain a better understanding I tried to make the following mock example for just one gene in R.

conditions <- c(rep("a", 5), rep("b", 5)) # some random data values_a <- c(7,5,103,20,80) values_b <- c(12,40,170,92,110) example <- data.frame("counts" = c(values_a, values_b), "condition" = c(rep("a", 5), rep("b", 5))) #compute normalization factors s <- c(values_a, values_b) / exp(mean(log(c(values_a, values_b)))) #fit negative binomial glm - EDIT: This is wrong, see first answer res1 <- MASS::glm.nb(log(counts) ~ condition, data = example) mu1 <- exp(res1$fitted.values) * s #fit model without log res2 <- MASS::glm.nb(counts ~ condition, data = example) mu2 <- res2$fitted.values * s

6) how could my code / approach be corrected? I have a feeling that my attempt is not correct as I am fitting 1 single model across all replicates and conditions and am also ignoring library sizes when fitting the model.

5) When I fit the model with/without log I get different results. Why is the one appropriate and not the other?

6) When I fit the model, R computes an estimate for the dispersion. DESeq2 however computes the gene-wise dispersion estimate *alpha_i.gw* differently. Is this a significant discrepancy? Is there some kind of iteration where the initial estimates for *mu_ij.hat* are updated using the newly estimated *alpha_i.gw*?

So far I have worked through the paper, the vignettes, the function documentation and some of the DESeq2 source code, but still have no clear picture in my mind. Any help or pointer in the right direction is greatly appreciated. Thank you very much for your time and help.

sessionInfo() R version 3.5.2 (2018-12-20) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Manjaro Linux Matrix products: default BLAS: /usr/lib/libblas.so.3.8.0 LAPACK: /usr/lib/liblapack.so.3.8.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] MASS_7.3-51.1 compiler_3.5.2 tools_3.5.2 yaml_2.2.0

side question from a newbie: is there a way to use something like LaTeX or MathML to format this post? My attempts so far failed..

0