Question

dispersion in edgeR

1

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 52 minutes ago

WEHI, Melbourne, Australia

Hi Naomi,

Thanks for your interesting questions about the edgeR model.

1) We assume that variability in RNA-seq counts come from three sources:

a) sampling variability associated with sequencing (for each lane),
b) technical variation in library preparation (lane to lane), and
c) biological variation

Sources (b) and (c) affect the underlying concentration of each transcript in each RNA sample, whereas (a) affects the precision with which this concentration is measured by sequencing technology. The dispersion parameter in the edgeR model measures the squared coefficient of variation (CV) of each transcript's concentration arising from sources (b) and (c). Experiments suggest that variability from source (b) is relatively minor, so the dispersion is essentially the squared CV of biological variation.

If you sequence deeply enough, you can theoretically eliminate variability from source (a) almost entirely. In other words, you can determine almost perfectly the concentration of each transcript in each sample. However you can't eliminate biological variability (b+c) in this way. As you sequence more and more deeply, power to detect differential expression is eventually determined only by biological variation, hence the asymptote that you mention. In the edgeR model, this is reflected by the fact that observed transcript concentrations converge to gamma distributed random variables with CV = sqrt(dispersion).

To further increase the power to detect differential expression you would need to reduce biological variability as well, and you could only do that by increasing the number of biological replicates. This is what the model predicts.

2) When shrinking the dispersion estimates, the amount of shrinkage depends on the precision with which the original value is estimated as well as by the weight of the prior distribution. For a given number of libraries, larger counts give more reliable estimates of the dispersion than small counts. Hence dispersions for rare transcripts tend to be shrunk more than dispersions for very abundant transcripts. Hence the shrinkage is not monontonic.

Best
Gordon

------------ original message ------------- [BioC] dispersion in edgeR Naomi Altman naomi at stat.psu.edu Fri Jun 25 20:03:52 CEST 2010 I have 2 questions about dispersion in edgeR. 1) The model implies that as sequencing depth increases, the power for testing differential expression comes to an asymptote. This seems odd. 2) Usually when using a shrinkage estimator, values of the original estimates shrink monotonely towards the common estimate. So, if one plots the moderated values against one another for 2 values of the shrinkage parameter, the plot should be monotone increasing. The plot was too big to attach, but what I did was: d10=estimateTagwiseDisp(d,prior.n=10) d30=estimateTagwiseDisp(d,prior.n=30) plot(d10$tagwise.dispersion,d30$tagwise.dispersion) I have not included my particular set of data, as I am pretty sure we see this for any set. This plot seems to imply that 2 genes could have the same moderated dispersion values at prior.n=10 and very different values at prior.n=30. This is not due to my --Naomi Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

Sequencing edgeR • 2.4k views

ADD COMMENT • link 13.8 years ago • updated 22 months ago Gordon Smyth 50k