Please excuse if this is not the right place to post this question. I am trying to wrap my head around the exact details of the DESeq2 method in order to reimplement a simplified version of it in Stan for training purposes.
At the moment I am struggling to grasp how means for the negative binomial distribution of each gene count K_ij are computed.
Information given in the paper from from Love, Huber, Anders (2010)
In the paper (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8#Materialsandmethods), the following is detailled:
K_ij ~ NB(mu_ij; alpha_i), so the count K for gene i in sample j is negative binomially distributed with mean mu_ij and dispersion alpha_i.
the mean mu_ij is specified as mu_ij = s_ij * q_ij
with s_ij = s_j = a normalization factor and q_ij = a quantity that is proportional to the concentration of the cDNA fragments of gene i in the sample j.
q_ij is modeled as log q_ij = sum (x_jr * beta_ir)
In the paper they say they start by fitting a negative binomial GLM to the gene's count data, to get initial estimates mu_ij.hat.
1) how does this model exactly look like written down and how do I get my initial estimates mu_ij.hat?
2) Are the raw gene counts the dependent variable, or are the q_ij? If so, how exactly do I derive the q_ij? If gene counts are used: at what point do I account for different library sizes?
3) Some lines above the paper states that the DESeq2 design matrix does not have full rank, but that a unique solution exists due to the zero-centered prior distribution. At this step I am not using the prior, do I nevertheless use a rank-deficient design matrix for this step?
My attempt at replicating what is done in R
To gain a better understanding I tried to make the following mock example for just one gene in R.
conditions <- c(rep("a", 5), rep("b", 5)) # some random data values_a <- c(7,5,103,20,80) values_b <- c(12,40,170,92,110) example <- data.frame("counts" = c(values_a, values_b), "condition" = c(rep("a", 5), rep("b", 5))) #compute normalization factors s <- c(values_a, values_b) / exp(mean(log(c(values_a, values_b)))) #fit negative binomial glm - EDIT: This is wrong, see first answer res1 <- MASS::glm.nb(log(counts) ~ condition, data = example) mu1 <- exp(res1$fitted.values) * s #fit model without log res2 <- MASS::glm.nb(counts ~ condition, data = example) mu2 <- res2$fitted.values * s
6) how could my code / approach be corrected? I have a feeling that my attempt is not correct as I am fitting 1 single model across all replicates and conditions and am also ignoring library sizes when fitting the model.
5) When I fit the model with/without log I get different results. Why is the one appropriate and not the other?
6) When I fit the model, R computes an estimate for the dispersion. DESeq2 however computes the gene-wise dispersion estimate alpha_i.gw differently. Is this a significant discrepancy? Is there some kind of iteration where the initial estimates for mu_ij.hat are updated using the newly estimated alpha_i.gw?
So far I have worked through the paper, the vignettes, the function documentation and some of the DESeq2 source code, but still have no clear picture in my mind. Any help or pointer in the right direction is greatly appreciated. Thank you very much for your time and help.
sessionInfo() R version 3.5.2 (2018-12-20) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Manjaro Linux Matrix products: default BLAS: /usr/lib/libblas.so.3.8.0 LAPACK: /usr/lib/liblapack.so.3.8.0 locale:  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8  LC_COLLATE=en_US.UTF-8 LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8  LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C  LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C attached base packages:  stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached):  MASS_7.3-51.1 compiler_3.5.2 tools_3.5.2 yaml_2.2.0