Question

EdgeR: Explaning dispersion types to newbies

1

Entering edit mode

Mr.RB ▴ 20

@mrrb-15633

Last seen 5.0 years ago

Little background

I have used edgeR quite some time now and try to teach others to use it as well. Most of them are familiar with hypothesis testing, means, variances etc, but none looked at RNA-seq data before. So the testing part, i.e. comparing the distribution of two treatments is understandable. However the most troublesome part, at least as I experience it, is understanding the different types of dispersion and how these influence hypothesis testing. I think this is much easier to understand when visually shown, but not as the `BCVplot` but more like density plots for several genes when using common dispersion, trended dispersion and tag wise dispersion.

Question

How could I best explain the concepts of dispersion and the difference among them to people not familiar with dispersion estimates (see little background)? I prefer a visual approach but I'm not sure how this can be done.

Note again that I read the papers and other question, but this seems either too complicated or too focused such that the whole picture gets hard to grasp.

edger dispersion estimatedispersions visualization teaching • 8.7k views

ADD COMMENT • link updated 5.8 years ago by Aaron Lun ★ 28k • written 5.8 years ago by Mr.RB ▴ 20

score 7 · Accepted Answer · 2018-06-22

7

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 1 hour ago

The city by the bay

Hm. The concepts seem pretty obvious to me, but then again I suppose they would be.

Common dispersion is the mean dispersion across all genes.
Trended dispersion is the mean dispersion across all genes with similar abundance. In other words, the fitted value of the mean-dispersion trend.

I suppose I should qualify this by saying that we don't literally compute the mean in edgeR, but rather take the dispersion value that maximizes the adjusted profile likelihood. This is unlikely to be helpful for your audience, so you can probably just simplify it by saying that it's the mean.

For the tagwise dispersions, perhaps the simplest way to proceed would be to make a BCV plot from the output of estimateDisp with prior.df=0. This will yield dispersion estimates without any empirical Bayes shrinkage, i.e., "raw" tagwise estimates that do not share information across genes. You can then compare this to the BCV plot with the default prior.df; you should see that the points are squeezed towards the trend (or towards the common value in red, if you set trend="none"). This should demonstrate how empirical Bayes shrinkage works, effectively squeezing values together to reduce the effect of estimation uncertainty when you have low numbers of replicates.

Now, as for how this affects hypothesis testing - there are three main points, in order of obviousness:

Larger dispersions = higher variance between replicates, which reduce power to detect DE.
The performance of the model (and thus of the DE analysis) depends on the accuracy of the dispersion estimate. If there is a strong mean-dispersion trend, the common dispersion is obviously unsuitable. If the gene-specific dispersions vary around the trend, the trended dispersion is unsuitable. The "raw" tagwise estimates are unbiased estimates of the gene-specific dispersions, and should be the most suitable, except...
The performance of the model also depends on the precision with which the dispersions are estimated. Here, the raw estimates are least stable as they use the least amount of information, whereas the trended (and to a greater extent, common) dispersion estimates share information between genes for greater stability. This is why the shrunken tagwise estimates (that you get with default estimateDisp) are so useful, as they provide a compromise between precision and accuracy.

You may already know that we are now recommending the QL framework with glmQLFit and glmQLFTest for routine GLM-based DE analyses. This introduces another set of concepts, namely the distinction between negative binomial dispersions and quasi-likelihood dispersions. Long story short, the NB dispersions aim to model the mean-dispersion relationship across the entire dataset, while the QL dispersions aim to capture the variability and estimation uncertainty of the dispersion for each gene.

ADD COMMENT • link 5.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thankyou! the prior.df=0 to prior.df = x should really visually clarify the compromis!

ADD REPLY • link 5.8 years ago Mr.RB ▴ 20

0

Entering edit mode

Do you have any suggestions to show what a GLMfit actually does, or to explain this simply. I think the testing of coefficients would be easy understandable, but I completely forgot the fitting part... I know about the gof() plot but I think this will more focus on the results than what it actually does

ADD REPLY • link 5.8 years ago Mr.RB ▴ 20

0

Entering edit mode

If you're asking about the glmFit function, it fits a negative binomial GLM to the counts for each gene. Exactly how it does so involves a number of interesting tricks and speed-ups (IRLS with Levenberg-Marquardt dampening), but this should not be relevant to you or your audience. If you're asking about the concept of fitting a GLM; that is a more general statistical question rather than anything specific to edgeR. I would suggest a more appropriate forum like https://stats.stackexchange.com/, or various online resources on GLMs (starting with linear models may be easier).

ADD REPLY • link 5.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Sorry if I wasn't clear I actually ment whether I could produce a plot like the last figure on this page. Let's say I done my tag wise dispersion estimates based on the whole data, then select a single gene and its counts and dispersion. Like so:

dat <- read.table(text ="con.1 con.2 treat.1 treat.2
                  8100        9599        8144        7000", 
                  header = T)

disp <- 0.01776080

Based on this information I should be able to produce I plot like that right? I tried doing so by using mglmOneWay but couldn't figure it out ;(. If I could produce some of these plots it would be much easier to understand what the two coefficients actually are and why it makes sense to test hypothesis on these.

ADD REPLY • link 5.8 years ago Mr.RB ▴ 20

0

Entering edit mode

Just get the estimated coefficients from the DGEGLM object produced by glmFit. These can be interpreted in the same manner as for linear models (keeping in mind that NB GLMs use a log-link, so the coefficients represent log-differences in expression).

ADD REPLY • link 5.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Sorry for bothering you again! But I can't figure what these coefficients mean. Let's say we compare two treatments, this will result in two coefficients: the first one is the baseline for group1 and the second the 2 vs 1 comparison (according to the user guide). Indeed when I plot coefficient 1 against coefficient 2 all genes falling on/around the diagonal are not significant. But I don't get what I'm comparing here. As this would be one model, the the model would look like this I suppose: y = b0 + treatment * b1. (where treatment is just 1 or 0). So If two conditions would have the same means I would think that b1 is just near zero, however when I look at the output you refered to the second coefficient is nearly the same as the first one for non DE genes why is this?

ADD REPLY • link 5.8 years ago Mr.RB ▴ 20

0

Entering edit mode

Plotting coefficient 2 against coefficient 1 across all genes is meaningless. You already understand that coefficient 2 represents the log-fold change between groups. It follows that your DE genes will be those where coefficient 2 is significantly different from zero. The behaviour of the first coefficient is not relevant.

ADD REPLY • link 5.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Aaah that's clear! Got all the steps now. Thankyou for the great and quick support for the questions!

ADD REPLY • link 5.8 years ago Mr.RB ▴ 20

0

Entering edit mode

hmm I just checked it:

edgeR.fit <- edgeR::glmFit(edgeR.dgelist , design)
edgeR.glrt <- edgeR::glmLRT(edgeR.fit, contrast = mc)

head(edgeR.fit$coefficients,2)
#       groupcon grouptreat
#gene1 -7.057066  -7.104031
#gene2 -6.580015  -6.706760

head(edgeR.glrt$table,2)
#       logFC    logCPM         LR       PValue
#gene1 -0.06775646  9.717235  0.1233314 0.7254490213
#gene2 -0.18285459 10.350316  0.9816747 0.3217856899

But the coefficient is much smaller than zero, but the fold change is nearly zero and the gene is not significant

ADD REPLY • link 5.8 years ago Mr.RB ▴ 20

0

Entering edit mode

You're not fitting the model that you said you were fitting. In your previous comments, you described a model with an intercept, where the first coefficient represents the log-expression in the "baseline" group and the second coefficient represents the log-fold change. In your latest comment above, you are fitting a no-intercept model, where the two coefficients represent the log-expression in the respective groups. Mixing the two interpretations doesn't make sense. (Also, the coefficients are natural log while the log-fold change is log2.) I suggest you re-read the relevant parts of the edgeR (and limma) user's guides.

ADD REPLY • link 5.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

That's why i said " the model would look like this I suppose") , but apparently I was wrong here, but then the plot I mentioned (coefficient 1 vs coefficient) does makes sense. Anyway I will re-read it

ADD REPLY • link 5.8 years ago Mr.RB ▴ 20