Help interpreting DESeq2 method for gene-wide dispersion estimates
1
0
Entering edit mode
@jeremyteitelbaum-16687
Last seen 4.5 years ago
USA/Connecticut/UConn and JAX-GM

The DESeq2 paper, page 15 explains the first of the three steps the algorithm uses for finding the dispersion as a function of the mean as follows:

To get a gene-wise dispersion estimate for a gene i, we start by fitting a negative binomial GLM without an LFC prior for the design matrix X to the gene’s count data. This GLM uses a rough method-of-moments estimate of dispersion, based on the within-group variances and means.

I would like to understand this better and I would appreciate more details on this "rough method-of-moments estimate."  In principle I can extract this from the code, but that hasn't proved straightforward.  Even a clear pointer to exactly where this is carried out in the source code would be a big help.

deseq2 negative binomial model glm • 833 views
1
Entering edit mode
@mikelove
Last seen 9 hours ago
United States

Best to just check the code, we use the minimum of roughDispEstimate() which uses a linear model to fit estimated counts, and momentsDispEst() which just uses the simple mean and variance across all samples and the simple moments based estimate. These are unexported functions in core.R. And remember this is none of the genewise, fitted, or posterior dispersion estimate used in DESeq2, it’s just a rough, stand-in value before the genewise dispersion is estimated.

0
Entering edit mode

Thanks, this is helpful!  Is there a place where you provide more details on the mathematics of fitting the dispersion function to the data beyond what's in the paper?  Perhaps slides from a course on the mathematical foundations of the algorithm, or something like that?

0
Entering edit mode

Do you mean, how dispersionFunction is fit?

In ?estimateDispersions we have:

"• parametric - fit a dispersion-mean relation of the form:
dispersion = asymptDisp + extraPois/mean
via a robust gamma-family GLM. The coefficients asymptDisp and extraPois are given"

Then in the DESeq2 paper we say:

"For the trend function, we use the same parametrization as we used for DEXSeq, namely, ..."

"To obtain the coefficients a0 and a1, we regress the dispersion estimates  for all counting bins from all genes on their average normalized count values  with a gamma-family GLM. To ensure robustness of the fit, we iteratively leave out bins with large residuals until convergence is achieved (Huber 1981)."