I originally asked this question on Biostars and a responder suggested I try here. To avoid any duplication of effort, I will copy or link any answer. If you have a better suggestion, please let me know.
Below is a copy-pasted version of my question:
McCarthy, D.J., Chen, Y., and Smyth, G.K. (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res 40, 4288–4297.
In Figure 2 of this paper, the authors show that estimating dispersion on a per-gene basis is more compatible with their data:
(I realise I'm probably not supposed to copy the figure here, but I've acknowledged the source)
I think understand broadly what is being demonstrated here (please correct me if I'm mistaken): When we estimate dispersions, that is an implicit model of the ratio of the mean to the standard deviation of each gene. Here, the authors are showing, with QQ plots, that the per-gene model describes the observed ratio better than a common dispersion value. Each dot in the plot corresponds to a gene.
I'd like to generate this figure for my own data, but I don't understand how to compute the two vectors required as inputs to qqplot(). I'm guessing that one might be the log likelihood after fitting the GLM?
Thanks for any light you can shed (code also gratefully appreciated, but no obligation)