Question

What's the rationale for using the negative binomial distribution to model read counts?

2

Entering edit mode

kjo ▴ 70

@kjo-11078

Last seen 7.3 years ago

Several tools for DE analysis (including DEseq2 and edgeR) model read counts as following a negative binomial distribution (NBD).

According to the Wikipedia page on the NBD, "...the negative binomial distribution is a discrete probability distribution of the number of successes in a sequence of independent and identically distributed Bernoulli trials before a specified (non-random) number of failures (denoted r) occurs."

I fail to see how read counts fit this description. What would constitute the "sequence of independent and identically distributed Bernoulli trials" in such a model of read counts? And what is the "specified ... number of failures"?

Alternatively, is it possible to describe the modeling of read counts in terms of some other (equivalent) characterization of the NBD?

Bottom line: what is the justification for using the NBD to model read counts?

rnaseq • 6.3k views

ADD COMMENT • link updated 7.8 years ago by Ryan C. Thompson ★ 7.9k • written 7.8 years ago by kjo ▴ 70

2

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 10 hours ago

United States

The rationale can be found further down on that Wikipedia page:

The negative binomial distribution, especially in its alternative parameterization described above, can be used as an alternative to the Poisson distribution. It is especially useful for discrete data over an unbounded positive range whose sample variance exceeds the sample mean. In such cases, the observations are overdispersed with respect to a Poisson distribution, for which the mean is equal to the variance. Hence a Poisson distribution is not an appropriate model. Since the negative binomial distribution has one more parameter than the Poisson, the second parameter can be used to adjust the variance independently of the mean. See Cumulants of some discrete probability distributions.

Discrete data over an unbounded positive range is one way of saying 'integer counts'. And it has been shown repeatedly that RNA-Seq counts are over-dispersed, meaning that the sample variance exceeds the sample mean (the variance and mean are the same value for the Poisson distribution).

ADD COMMENT • link 7.8 years ago James W. MacDonald 65k

score 5 · Accepted Answer · 2016-07-08

5

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 8 months ago

Scripps Research, La Jolla, CA

In addition to James MacDonald's answer, here's another relevant passage from the same Wikipedia article: "The negative binomial distribution also arises as a continuous mixture of Poisson distributions (i.e. a compound probability distribution) where the mixing distribution of the Poisson rate is a gamma distribution." The other important bit of information to know is that read counts for a sample in theory follow a Binomial(n,p) distribution, where n is the total number of reads and p is the probability of a read mapping to a specific gene. However, the binomial distribution is computationally inconvenient, since it involves computing factorials, and with millions of reads in each sample, the Poisson distribution (with lambda = n*p) is an excellent approximation to the Binomial while being far more mathematically tractable. So the Poisson noise quantifies the counting uncertainty, while the gamma distribution quantifies the variation in gene expression between replicates. The mixture of the two yields the negative binomial. See also section 2 of http://www.statsci.org/smyth/pubs/edgeRChapterPreprint.pdf

ADD COMMENT • link 7.8 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Thanks! I now understand why some mixture of Poisson distributions would be a better representation of the phenomenon being modeled, but I still don't understand the justification using the gamma distribution (GD) as the weighing distribution for the mixture. I realize that, in this case, the GD turns out to be analytically nice-to-work-with (i.e. with the GD as the weighing distribution, the mixture integral simplifies to the relatively simple NBD), but if that's the only justification for it, then it would be a clear case of the "streetlight effect". Can the choice of the GD be justified on more substantive grounds than analytic convenience?

ADD REPLY • link 7.8 years ago kjo ▴ 70

1

Entering edit mode

Because the resulting model is useful. Sure, you could fit an elaborate model that includes all theoretical contributions to experimental and biological variability. Or, you could just fit a simple NB-based model that accounts for the major features of the data, i.e., discrete and overdispersed counts. See George Box's quote that "All models are wrong but some are useful". Will the expression levels of a gene be Gamma-distributed in reality? Almost definitely not, but for a continuous, unimodal (in bulk RNA-seq data) distribution of positive expression values, the Gamma distribution is close enough.

In short, don't think of it as a streetlight at night. Instead, it's more like a sunny day on a meadow with with a couple of trees. While some of the field is in shadow (corresponding to special cases that aren't handled well by the NB model, e.g., bimodal distribution of expression values in single-cell RNA-seq data), most of it can still be searched, even when drunk.

P.S. Alternative Poisson mixtures have been studied for other purposes, e.g., Poisson-Beta models for transcriptional bursting, Poisson-log-normal models for network modelling. But they involve more effort to fit and get inferences from, and are mainly used because of some property of the mixing distribution that is relevant to the specific application (e.g., on/off rates as parameters of the Beta distribution). For routine analyses, if something simpler does the job, why not use it?

ADD REPLY • link 7.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

As far as I know, the analytic/computational convenience is the primary justification. We're talking about modeling the biological variation here, and in theory the distribution of biological variation could be anything. It depends on the biological system in question. Unfortunately, most RNA-seq experiments have very few replicates, so analysis isn't possible without making some kind of parametric assumption about the biological variation. With many more replicates, you could use a non-parametric approach and avoid such assumptions, but with the typical number of replicates, this approach has very low power.

If you don't like the assumption of a gamma distribution, there are packages that offer more flexible distributions, such as the tweeDESeq, which adds a third parameter to the NB distribution that I believe allows fatter or thinner tails on the distribution biological variation than the gamma can accommodate. Again, though, estimating more parameters requires more replicates. There is also limma's voom function, which ignores the principle of fitting a mixture of Poisson distributions and simply fits a linear model (i.e. using the normal distribution) with observation weights derived from the empirical mean-variance trend. In both voom and in the case of edgeR and others choosing the gamma distribution to model the biological variation, the key idea is that the correctness of an inference is far more sensitive to the magnitude of the variance than the shape of the distribution. This is borne out by numerous studies in which the humble t-test is shown to be quite robust even in the face of substantial deviations from normality in the underlying data, and is also supported by the fact that limma-voom and edgeR's QL method give quite similar p-values on a good quality data set, despite the substantial differences in parametric assumptions.

Regarding the idea that limiting oneself to the gamma distribution represents an unreasonable limiting of once's search space, we can ask in what cases the gamma distribution will fail to capture the true magnitude of variability. I would assume that NB GLM methods would overestimate the significance (i.e. underestimate the FDR) when the biological variation has a heavy-tailed distribution and the gamma cannot fit those fat tails, and vice versa for light-tailed distributions. Additionally, there is probably room for error on distributions that are heavily skewed in one direction or the other.

ADD REPLY • link 7.8 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

@Aaron, @Ryan: I don't have an alternative to the GD that I think would be better. My question was simply: out all the possible distributions one could have picked to model the uncertainty in the Poisson parameter, why the GD? After reading more on the matter, I think the answer does boil down to analytic convenience, the GD being the conjugate prior of the Poisson, but in a Bayesian context this is the natural choice. IOW, in the context of Bayesian updating, many weighting distribution would be comparably adequate; one may as well choose the one that is analytically most convenient, and that's the GD in this case.

ADD REPLY • link 7.8 years ago kjo ▴ 70

1

Entering edit mode

You haven't quite got it yet. Your assumption that there must be a whole lot of "comparatively adequate" candidate distributions is not correct.

To be reasonable for the application, the mixing distribution has to have a constant coefficient of variation and has to be defined on the positive real line. Among all well known distributions, the only ones that have these properties are the gamma and the log-normal. So the only available choices for the count distribution are a gamma-Poisson mixture or a lognormal-Poisson mixture. These two distributions are almost impossible to distinguish, but one has a an analytic closed form and the other does not. So the choice is obvious.

Most of this is explained in the article that Ryan links to. Notice that Section 2.2 of the article is derived entirely without assuming any specific mixing distribution.

Every distribution used in statistics is based partly on analytic convenient, but no more here than elsewhere.

ADD REPLY • link 7.8 years ago Gordon Smyth 50k