The DESeq2 paper, page 15 explains the first of the three steps the algorithm uses for finding the dispersion as a function of the mean as follows:
To get a gene-wise dispersion estimate for a gene i, we start by fitting a negative binomial GLM without an LFC prior for the design matrix X to the gene’s count data. This GLM uses a rough method-of-moments estimate of dispersion, based on the within-group variances and means.
I would like to understand this better and I would appreciate more details on this "rough method-of-moments estimate." In principle I can extract this from the code, but that hasn't proved straightforward. Even a clear pointer to exactly where this is carried out in the source code would be a big help.
Thanks, this is helpful! Is there a place where you provide more details on the mathematics of fitting the dispersion function to the data beyond what's in the paper? Perhaps slides from a course on the mathematical foundations of the algorithm, or something like that?
Do you mean, how dispersionFunction is fit?
In ?estimateDispersions we have:
"• parametric - fit a dispersion-mean relation of the form:
dispersion = asymptDisp + extraPois/mean
via a robust gamma-family GLM. The coefficients asymptDisp and extraPois are given"
Then in the DESeq2 paper we say:
"For the trend function, we use the same parametrization as we used for DEXSeq, namely, ..."
Then if you follow the link to the DEXSeq paper you can read:
"To obtain the coefficients a0 and a1, we regress the dispersion estimates for all counting bins from all genes on their average normalized count values with a gamma-family GLM. To ensure robustness of the fit, we iteratively leave out bins with large residuals until convergence is achieved (Huber 1981)."