Search
Question: Correct sequence of functions for edgeR, edgeR robust and edgeR QL.
0
2.2 years ago by
UCPH
maltethodberg50 wrote:

Due to the increase in functionality in edgeR, I've become slightly confused about the correct order of calling the different functions, since there seems to be multiple ways of doing the same thing. Specifically, there are currently 3-4 ways of estimating dispersion:

1) 3-step method:

• estimateGLMCommonDisp
• estimateGLMTrendedDisp
• estimateGLMTagwiseDisp

2) All-in-one method:

• estimateDisp

3) Robust method

• estimateGLMRobustDisp

4) QL

• estimateDisp
• glmQLFit

The estimateDisp function help file says it's similar to calling estimateGLMCommonDispestimateGLMTrendedDispestimateGLMTagwiseDisp, but also that it gives slightly different results. What's the difference between these two approaches, and is estimateDisp always the preferred?

The estimateDisp also has an argument called robust, which is suggested to be set to TRUE in the edgeR user guide. Is running estimateDisp(robust=TRUE) similar to running estimateGLMRobustDisp

To make the confusion total, there's the separate approach of using the QL approach instead. The manual suggests using estimateDisp(robust=TRUE), even glmQLFit re-estimates the dispersions. Again, glmQLFit also has an argument called robust. In addition, it has an argument called abundance.trend (which I have not seen mentioned in any manuals), should this ever be set to TRUE?

What's considered a good rule of thumb for running the different implementations? Something like below:

edgeR:

1. estimateDisp(robust=TRUE)
2. glmFit
3. glmLRT

edgeR robust:

1. estimateGLMRobustDisp
2. glmFit
3. glmLRT

edgeR QL:

1. estimateDisp(robust=TRUE)
2. glmQLFit(robust=TRUE)
3. glmQLFTest

modified 2.2 years ago by Gordon Smyth33k • written 2.2 years ago by maltethodberg50
6
2.2 years ago by
Aaron Lun19k
Cambridge, United Kingdom
Aaron Lun19k wrote:

Your call sequences are correct for all three approaches. A couple of comments, though:

1. As for why estimateDisp is different from the common/trended/tagwise trio of methods, see C: edger, trended or common dispersion.
2. No, estimateGLMRobustDisp is distinct from robust=TRUE in estimateDisp. The former uses observation weights to reduce the impact of outlier counts within each gene. The latter doesn't protect each gene from outlier observations, but instead protects the empirical Bayes shrinkage from genes with outlier dispersions (that are caused by outlier counts). This difference in behaviour has some practical consequences. For example, a gene that is DE but has a couple of outlier counts in one group will (hopefully) still be detected as DE with estimateGLMRobustDisp. This is because those outlier counts should be downweighted, thus avoiding increases to the dispersion estimate. In contrast, with estimateDisp, the dispersion gets inflated by the outliers so there won't be enough power to detect DE - however, any deleterious effect of the inflated dispersion on EB shrinkage of all other genes is prevented with robust=TRUE. I use estimateDisp as it's faster (no need to iteratively compute weights), but there can be some benefit from robustifying against outlier observations - read the NAR paper.
3. glmQLFit estimates QL dispersions, it doesn't re-estimate the NB dispersions; keep in mind that they are two separate sets of values. Technically, you don't need to set robust=TRUE in estimateDisp because that only affects the tagwise NB dispersions that are never used in the QL framework - only the trended NB dispersions are used. That said, robustifying doesn't do any harm so you might as well do it if you plan to use the tagwise NB dispersions for diagnostics later, e.g., with plotBCV. Note that the EB shrinkage is repeated within glmQLFit using the QL dispersions. Here, robust=TRUE in glmQLFit is analogous to that in estimateDisp, as it protects the EB shrinkage from outlier dispersions (QL, not NB). Finally, abundance.trend=TRUE is already the default, and just ensures that EB shrinkage of the QL dispersions is performed towards a mean-dependent trend based on the QL dispersions. This trend is distinct from that generated by estimateDisp, which concerns the NB dispersions.

So, all in all, there's a lot of options that can be very confusing, but all of them work reasonably well so it's not too critical what you pick. I would recommend using the QL pipeline because it provides more accurate error control as well as robustness to outliers. (The fact that I'm first author on the associated report is, of course, just a coincidence.)

Thank you very much for the detailed answer! I'm glad I was on the right track. With regards to edgeR robust, how does it compare to using limma+voom with array weights? Both seem to involve around dampening the effect of outliers, but seem to have a different focus, with limma+voom weighting samples vs and edgeR robust weighting features, respectively.

1

Array weights are focused on dealing with outlier samples, whereas the various robustness functions in edgeR deal with outlier observations in each feature, either by downweighting the observations or the entire feature. Depends on what you want to do; you use information across all features to estimate the array weight for each sample, so it's pretty reliable, but they won't provide protection when the outlier observations are stochastically distributed across many samples.

To follow up, is it reasonable to use the QL pipeline with the estimateGLMRobustDisp? Thanks.

It's not something I've ever tried, but I suppose it would be okay. estimateGLMRobustDisp reports trended dispersions after robustness weighting, which could then be used by glmQLFit, etc. So there's no theoretical reason against using them together, though I should note that currently the estimateGLMRobustDisp function relies on the old estimateGLM*Disp functions rather than the new and improved estimateDisp.

Would the QL pipeline actually use the computed robustness weights, or would it just use the dispersion trend that was estimated robustly but ignore the weights themselves?

Yes, the weights computed by estimateGLMRobustDisp will also be used in downstream GLM fitting and deviance estimation (which is the prelude to QL dispersion estimation). If one accepts that it was okay to use the weights in the GLM fitting for NB dispersion estimation in estimateGLMRobustDisp, then it's probably also fine to use the weights in the GLM fitting for QL dispersion estimation in glmQLFit, too.

4
2.2 years ago by
Gordon Smyth33k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth33k wrote:

Fair question. We do offer a lot of alternative pipelines and it could be confusing.

Aaron has already covered most things, but to understand what robust=TRUE does in estimateDisp(), glmQLFit() and eBayes() see our recently accepted manuscript:

The option has the same meaning in all three functions.

Here is quick overview of the different robustness concepts:

1. One or more samples are outliers: use arrayWeights() or voomWithQualityWeights() in limma.
2. One or more genes are more variable than the others: use robust=TRUE in the Bayes step in either edgeR or limma.
3. There are outlier observations, but not associated with any particular sample or any particular gene: use lmFit(,method="robust") in limma or estimateGLMRobustDisp() in edgeR.

The third option is what statisticians usually think of as robustness, but the first two options are more efficient because they involve global models. Option 1 assumes that outlier observations tend to occur for particular samples. Option 2 assumes that outlier observations tend to be associated with particular genes or with a group of genes. This is more common than you might think.

Thank you for the response and link. I was not aware that lmFit had an option for robust regression. So in theory it would be possible to combine all 3 robustness concepts into one by running voomWithQualityWeights > lmFit(method="robust") > eBayes(robust=TRUE)?

1

Possible but very, very unlikely to be sensible.

(1) and (2) consist of concise global models and work ok together but (3) involves estimating a robustness weight for every single observation and would be best used by itself.