Question

Statistical background for computing intial estimates for the means of negative binomially distributed gene counts in DESeq2

0

Entering edit mode

nikosbosse • 0

@nikosbosse-19203

Last seen 5.3 years ago

Please excuse if this is not the right place to post this question. I am trying to wrap my head around the exact details of the DESeq2 method in order to reimplement a simplified version of it in Stan for training purposes.

At the moment I am struggling to grasp how means for the negative binomial distribution of each gene count K_ij are computed.

Information given in the paper from from Love, Huber, Anders (2010)

In the paper (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8#Materialsandmethods), the following is detailled:

K_ij ~ NB(mu_ij; alpha_i), so the count K for gene i in sample j is negative binomially distributed with mean mu_ij and dispersion alpha_i.

the mean mu_ij is specified as mu_ij = s_ij * q_ij

with s_ij = s_j = a normalization factor and q_ij = a quantity that is proportional to the concentration of the cDNA fragments of gene i in the sample j.

q_ij is modeled as log q_ij = sum (x_jr * beta_ir)

My questions

In the paper they say they start by fitting a negative binomial GLM to the gene's count data, to get initial estimates mu_ij.hat.

1) how does this model exactly look like written down and how do I get my initial estimates mu_ij.hat?

2) Are the raw gene counts the dependent variable, or are the q_ij? If so, how exactly do I derive the q_ij? If gene counts are used: at what point do I account for different library sizes?

3) Some lines above the paper states that the DESeq2 design matrix does not have full rank, but that a unique solution exists due to the zero-centered prior distribution. At this step I am not using the prior, do I nevertheless use a rank-deficient design matrix for this step?

My attempt at replicating what is done in R

To gain a better understanding I tried to make the following mock example for just one gene in R.

conditions <- c(rep("a", 5), rep("b", 5))
# some random data
values_a <- c(7,5,103,20,80) 
values_b <- c(12,40,170,92,110)

example <- data.frame("counts" = c(values_a, values_b),
                      "condition" = c(rep("a", 5), rep("b", 5)))

#compute normalization factors
s <- c(values_a, values_b) / exp(mean(log(c(values_a, values_b))))

#fit negative binomial glm - EDIT: This is wrong, see first answer
res1 <- MASS::glm.nb(log(counts) ~ condition, data = example)
mu1 <- exp(res1$fitted.values) * s

#fit model without log 
res2 <- MASS::glm.nb(counts ~ condition, data = example)
mu2 <- res2$fitted.values * s

6) how could my code / approach be corrected? I have a feeling that my attempt is not correct as I am fitting 1 single model across all replicates and conditions and am also ignoring library sizes when fitting the model.

5) When I fit the model with/without log I get different results. Why is the one appropriate and not the other?

6) When I fit the model, R computes an estimate for the dispersion. DESeq2 however computes the gene-wise dispersion estimate alpha_i.gw differently. Is this a significant discrepancy? Is there some kind of iteration where the initial estimates for mu_ij.hat are updated using the newly estimated alpha_i.gw?

So far I have worked through the paper, the vignettes, the function documentation and some of the DESeq2 source code, but still have no clear picture in my mind. Any help or pointer in the right direction is greatly appreciated. Thank you very much for your time and help.

sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux

Matrix products: default
BLAS: /usr/lib/libblas.so.3.8.0
LAPACK: /usr/lib/liblapack.so.3.8.0

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8       
[4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
[7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] MASS_7.3-51.1  compiler_3.5.2 tools_3.5.2    yaml_2.2.0

deseq2 design matrix • 1.2k views

ADD COMMENT • link 5.3 years ago nikosbosse • 0

0

Entering edit mode

side question from a newbie: is there a way to use something like LaTeX or MathML to format this post? My attempts so far failed..

ADD REPLY • link 5.3 years ago nikosbosse • 0

score 2 · Accepted Answer · 2019-01-03

2

Entering edit mode

Michael Love 41k

@mikelove

Last seen 11 hours ago

United States

hi,

I'll give a shot at answering all these questions, but I have a point from the start: many of the algorithmic choices of DESeq2 are designed for it to be fast and computationally robust. Because it is run by many groups, it wouldn't be acceptable for the software to not converge half of the time. However, if you are building a Bayesian hierarchical model (BHM), you can experiment and tune the model to a given dataset. So there are some different choices that one could make when writing up a BHM as compared to the DESeq2 implementation. The basics of the DESeq2-apeglm model are:

K_ij ~ NB(mu_ij, alpha_i)
mu_ij = s_ij q_ij
log q_ij = X_j* beta_i
beta_ik ~ Cauchy(0, S_k)
log alpha_i ~ Normal( log alpha_tr(bar-mu_i), sigma^2)

You could then put priors on the parameters S_k and sigma^2. These are the critical parameters that decide how much spread there should be for the beta's and the alpha's. You could also replace log alpha_tr(bar-mu_i) with something like log alpha_0, and then put a prior on that as well. I don't have concrete recommendations for what those priors would be. I've never tried to write up DESeq2 as a BHM, although I do like to work with Stan, but usually for models that are smaller and easier to fit. I won't be able to help code this or help with fitting.

I use the apeglm formulation above, because we've now shown the Cauchy prior to outperform the Normal prior. A group recently also swapped out the Normal in the last equation (the log dispersion) and replaced it with a t distribution (Eling et al 2018). This makes sense and probably is a better choice. We used a Normal for convenience in DESeq2 (again, speed and computationally robust), and then we deal with outlier dispersion values with a heuristic.

(1) If you were doing a hierarchical model, then you don't need initial mu_ij.

(2) Y_ij are the only observed values. Those are the counts (unscaled). In tximport-DESeq2, the s_ij are estimated by upstream software (e.g. Salmon, tximport, and DESeq2 all contribute to s_ij), and then s_ij is treated as fixed.

(3) We stopped using that approach (expanded model matrices). They were too confusing. We now use full rank design matrices.

(4) I'm not able to help with coding this right now.

(5) You should not take the log of counts. The counts are distributed NB, not the log of counts.

(6) If you are coding up a BHM, then the model will estimate the dispersion for each gene. It is a parameter of the model.

ADD COMMENT • link 5.3 years ago Michael Love 41k

0

Entering edit mode

Thank you very much for taking the time to answer and your kind and thorough reply. I very much appreciate it.

I'm not quite sure yet what approach would suit me best as I am just beginning to work with Bayesian models in general and Differential Gene Expression in particular. So my idea was, just as you said, to follow the general DESeq2-approach, but to replace estimating S_k and sigma^2 by providing priors for them.

I like the idea of the dispersion following a trend, e.g. setting a prior of log alpha_i ~ Normal( log alpha_tr(bar-mu_i), sigma^2). But this is exactly where I get stuck:

In order to fit a trend for the alphas, I need gene-wise estimates of the dispersion, alpha_i.gw. If I understood correctly, this is obtained for gene i as the alpha that maximizes the Cox-Reid-adjusted likelihood, i.e. that maximizes [log f_NB(K_i1; hat-mu_i1, alpha) + log f_NB(K_i2; hat-mu_i2, alpha) + .... - adjustment term].

I have not yet understood how exactly I could obtain those initial estimates mu_ij that I need to plug into the above formula to get a gene-wise estimate of dispersion (Or how to circumvent this by using a hierarchical model).

ADD REPLY • link 5.3 years ago nikosbosse • 0

0

Entering edit mode

You could run DESeq2 and use dispersionFunction(dds) to define the trend. I don’t know how to encode a trended prior in a BHM.

ADD REPLY • link 5.3 years ago Michael Love 41k

0

Entering edit mode

ah yes that could be a reasonable approach for a practical implementation.

But for the purpose of really understanding how DESeq2 works internally, could you maybe please explain to me how DESeq2 manages to obtain those intial estimates hat-mu_ij?

I tried to extract this information from the source code of estimateDispersionsGeneEst, but unfortunately wasn't able to see clearly through all the steps and internal functions that are used.

Thank you very much!

ADD REPLY • link 5.3 years ago nikosbosse • 0

1

Entering edit mode

Those hat-mu_ij for most designs are just the means of normalized counts for the group, then adjusted back up or down by the size factor. You can get them with a linear model.

ADD REPLY • link 5.3 years ago Michael Love 41k

0

Entering edit mode

I understand. That helped a lot. So I can take the mean of the normalized counts for estimating mu_ij and later for estimating log q_ij = X_j* beta_i, q_ij are just the normalized counts, right?

Thanks!

ADD REPLY • link 5.3 years ago nikosbosse • 0

1

Entering edit mode

The following applies for e.g. multi-group designs: hat-mu_ij is not the mean of normalized counts. It is the mean of normalized counts adjusted back up or down by the size factor. So if the mean of normalized counts for the group is 5 and the size factor is 2 then mu_ij becomes 10.

The way that DESeq2 (and any GLM including a BHM) would work is that you never calculate log q_ij. You use a link function and the s_ij are offsets.