Search
Question: Correct sequence of functions for edgeR, edgeR robust and edgeR QL.
1
gravatar for maltethodberg
2.6 years ago by
maltethodberg100
UCPH
maltethodberg100 wrote:

Due to the increase in functionality in edgeR, I've become slightly confused about the correct order of calling the different functions, since there seems to be multiple ways of doing the same thing. Specifically, there are currently 3-4 ways of estimating dispersion:

1) 3-step method:

  • estimateGLMCommonDisp
  • estimateGLMTrendedDisp
  • estimateGLMTagwiseDisp

2) All-in-one method:

  • estimateDisp

3) Robust method

  • estimateGLMRobustDisp

4) QL

  • estimateDisp
  • glmQLFit

 

The estimateDisp function help file says it's similar to calling estimateGLMCommonDispestimateGLMTrendedDispestimateGLMTagwiseDisp, but also that it gives slightly different results. What's the difference between these two approaches, and is estimateDisp always the preferred?

The estimateDisp also has an argument called robust, which is suggested to be set to TRUE in the edgeR user guide. Is running estimateDisp(robust=TRUE) similar to running estimateGLMRobustDisp

To make the confusion total, there's the separate approach of using the QL approach instead. The manual suggests using estimateDisp(robust=TRUE), even glmQLFit re-estimates the dispersions. Again, glmQLFit also has an argument called robust. In addition, it has an argument called abundance.trend (which I have not seen mentioned in any manuals), should this ever be set to TRUE?

What's considered a good rule of thumb for running the different implementations? Something like below:

edgeR:

  1. estimateDisp(robust=TRUE)
  2. glmFit
  3. glmLRT

edgeR robust:

  1. estimateGLMRobustDisp
  2. glmFit
  3. glmLRT

edgeR QL:

  1. estimateDisp(robust=TRUE)
  2. glmQLFit(robust=TRUE)
  3. glmQLFTest

 

Any advise is much appreciated!

ADD COMMENTlink modified 2.6 years ago by Gordon Smyth35k • written 2.6 years ago by maltethodberg100
6
gravatar for Aaron Lun
2.6 years ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

Your call sequences are correct for all three approaches. A couple of comments, though:

  1. As for why estimateDisp is different from the common/trended/tagwise trio of methods, see C: edger, trended or common dispersion.
  2. No, estimateGLMRobustDisp is distinct from robust=TRUE in estimateDisp. The former uses observation weights to reduce the impact of outlier counts within each gene. The latter doesn't protect each gene from outlier observations, but instead protects the empirical Bayes shrinkage from genes with outlier dispersions (that are caused by outlier counts). This difference in behaviour has some practical consequences. For example, a gene that is DE but has a couple of outlier counts in one group will (hopefully) still be detected as DE with estimateGLMRobustDisp. This is because those outlier counts should be downweighted, thus avoiding increases to the dispersion estimate. In contrast, with estimateDisp, the dispersion gets inflated by the outliers so there won't be enough power to detect DE - however, any deleterious effect of the inflated dispersion on EB shrinkage of all other genes is prevented with robust=TRUE. I use estimateDisp as it's faster (no need to iteratively compute weights), but there can be some benefit from robustifying against outlier observations - read the NAR paper.
  3. glmQLFit estimates QL dispersions, it doesn't re-estimate the NB dispersions; keep in mind that they are two separate sets of values. Technically, you don't need to set robust=TRUE in estimateDisp because that only affects the tagwise NB dispersions that are never used in the QL framework - only the trended NB dispersions are used. That said, robustifying doesn't do any harm so you might as well do it if you plan to use the tagwise NB dispersions for diagnostics later, e.g., with plotBCV. Note that the EB shrinkage is repeated within glmQLFit using the QL dispersions. Here, robust=TRUE in glmQLFit is analogous to that in estimateDisp, as it protects the EB shrinkage from outlier dispersions (QL, not NB). Finally, abundance.trend=TRUE is already the default, and just ensures that EB shrinkage of the QL dispersions is performed towards a mean-dependent trend based on the QL dispersions. This trend is distinct from that generated by estimateDisp, which concerns the NB dispersions.

So, all in all, there's a lot of options that can be very confusing, but all of them work reasonably well so it's not too critical what you pick. I would recommend using the QL pipeline because it provides more accurate error control as well as robustness to outliers. (The fact that I'm first author on the associated report is, of course, just a coincidence.)

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Aaron Lun21k

Thank you very much for the detailed answer! I'm glad I was on the right track. With regards to edgeR robust, how does it compare to using limma+voom with array weights? Both seem to involve around dampening the effect of outliers, but seem to have a different focus, with limma+voom weighting samples vs and edgeR robust weighting features, respectively.

ADD REPLYlink written 2.6 years ago by maltethodberg100
1

Array weights are focused on dealing with outlier samples, whereas the various robustness functions in edgeR deal with outlier observations in each feature, either by downweighting the observations or the entire feature. Depends on what you want to do; you use information across all features to estimate the array weight for each sample, so it's pretty reliable, but they won't provide protection when the outlier observations are stochastically distributed across many samples.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Aaron Lun21k

To follow up, is it reasonable to use the QL pipeline with the estimateGLMRobustDisp? Thanks.

ADD REPLYlink written 9 months ago by Stephen Hoang30

It's not something I've ever tried, but I suppose it would be okay. estimateGLMRobustDisp reports trended dispersions after robustness weighting, which could then be used by glmQLFit, etc. So there's no theoretical reason against using them together, though I should note that currently the estimateGLMRobustDisp function relies on the old estimateGLM*Disp functions rather than the new and improved estimateDisp.

ADD REPLYlink modified 9 months ago • written 9 months ago by Aaron Lun21k

Would the QL pipeline actually use the computed robustness weights, or would it just use the dispersion trend that was estimated robustly but ignore the weights themselves?

ADD REPLYlink written 9 months ago by Ryan C. Thompson6.8k

Yes, the weights computed by estimateGLMRobustDisp will also be used in downstream GLM fitting and deviance estimation (which is the prelude to QL dispersion estimation). If one accepts that it was okay to use the weights in the GLM fitting for NB dispersion estimation in estimateGLMRobustDisp, then it's probably also fine to use the weights in the GLM fitting for QL dispersion estimation in glmQLFit, too.

ADD REPLYlink modified 9 months ago • written 9 months ago by Aaron Lun21k
4
gravatar for Gordon Smyth
2.6 years ago by
Gordon Smyth35k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth35k wrote:

Fair question. We do offer a lot of alternative pipelines and it could be confusing.

Aaron has already covered most things, but to understand what robust=TRUE does in estimateDisp(), glmQLFit() and eBayes() see our recently accepted manuscript:

  http://arxiv.org/abs/1602.08678

The option has the same meaning in all three functions.

Here is quick overview of the different robustness concepts:

  1. One or more samples are outliers: use arrayWeights() or voomWithQualityWeights() in limma.
  2. One or more genes are more variable than the others: use robust=TRUE in the Bayes step in either edgeR or limma.
  3. There are outlier observations, but not associated with any particular sample or any particular gene: use lmFit(,method="robust") in limma or estimateGLMRobustDisp() in edgeR.

The third option is what statisticians usually think of as robustness, but the first two options are more efficient because they involve global models. Option 1 assumes that outlier observations tend to occur for particular samples. Option 2 assumes that outlier observations tend to be associated with particular genes or with a group of genes. This is more common than you might think.

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Gordon Smyth35k

Thank you for the response and link. I was not aware that lmFit had an option for robust regression. So in theory it would be possible to combine all 3 robustness concepts into one by running voomWithQualityWeights > lmFit(method="robust") > eBayes(robust=TRUE)?

ADD REPLYlink written 2.6 years ago by maltethodberg100
1

Possible but very, very unlikely to be sensible.

(1) and (2) consist of concise global models and work ok together but (3) involves estimating a robustness weight for every single observation and would be best used by itself.

ADD REPLYlink modified 2.5 years ago • written 2.6 years ago by Gordon Smyth35k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 137 users visited in the last hour