Search
Question: Correct sequence of functions for edgeR, edgeR robust and edgeR QL.
0
gravatar for maltethodberg
20 months ago by
UCPH
maltethodberg40 wrote:

Due to the increase in functionality in edgeR, I've become slightly confused about the correct order of calling the different functions, since there seems to be multiple ways of doing the same thing. Specifically, there are currently 3-4 ways of estimating dispersion:

1) 3-step method:

  • estimateGLMCommonDisp
  • estimateGLMTrendedDisp
  • estimateGLMTagwiseDisp

2) All-in-one method:

  • estimateDisp

3) Robust method

  • estimateGLMRobustDisp

4) QL

  • estimateDisp
  • glmQLFit

 

The estimateDisp function help file says it's similar to calling estimateGLMCommonDispestimateGLMTrendedDispestimateGLMTagwiseDisp, but also that it gives slightly different results. What's the difference between these two approaches, and is estimateDisp always the preferred?

The estimateDisp also has an argument called robust, which is suggested to be set to TRUE in the edgeR user guide. Is running estimateDisp(robust=TRUE) similar to running estimateGLMRobustDisp

To make the confusion total, there's the separate approach of using the QL approach instead. The manual suggests using estimateDisp(robust=TRUE), even glmQLFit re-estimates the dispersions. Again, glmQLFit also has an argument called robust. In addition, it has an argument called abundance.trend (which I have not seen mentioned in any manuals), should this ever be set to TRUE?

What's considered a good rule of thumb for running the different implementations? Something like below:

edgeR:

  1. estimateDisp(robust=TRUE)
  2. glmFit
  3. glmLRT

edgeR robust:

  1. estimateGLMRobustDisp
  2. glmFit
  3. glmLRT

edgeR QL:

  1. estimateDisp(robust=TRUE)
  2. glmQLFit(robust=TRUE)
  3. glmQLFTest

 

Any advise is much appreciated!

ADD COMMENTlink modified 20 months ago by Gordon Smyth32k • written 20 months ago by maltethodberg40
5
gravatar for Aaron Lun
20 months ago by
Aaron Lun17k
Cambridge, United Kingdom
Aaron Lun17k wrote:

Your call sequences are correct for all three approaches. A couple of comments, though:

  1. As for why estimateDisp is different from the common/trended/tagwise trio of methods, see C: edger, trended or common dispersion.
  2. No, estimateGLMRobustDisp is distinct from robust=TRUE in estimateDisp. The former uses observation weights to reduce the impact of outlier counts within each gene. The latter doesn't protect each gene from outlier observations, but instead protects the empirical Bayes shrinkage from genes with outlier dispersions (that are caused by outlier counts). This difference in behaviour has some practical consequences. For example, a gene that is DE but has a couple of outlier counts in one group will (hopefully) still be detected as DE with estimateGLMRobustDisp. This is because those outlier counts should be downweighted, thus avoiding increases to the dispersion estimate. In contrast, with estimateDisp, the dispersion gets inflated by the outliers so there won't be enough power to detect DE - however, any deleterious effect of the inflated dispersion on EB shrinkage of all other genes is prevented with robust=TRUE. I use estimateDisp as it's faster (no need to iteratively compute weights), but there can be some benefit from robustifying against outlier observations - read the NAR paper.
  3. glmQLFit estimates QL dispersions, it doesn't re-estimate the NB dispersions; keep in mind that they are two separate sets of values. Technically, you don't need to set robust=TRUE in estimateDisp because that only affects the tagwise NB dispersions that are never used in the QL framework - only the trended NB dispersions are used. That said, robustifying doesn't do any harm so you might as well do it if you plan to use the tagwise NB dispersions for diagnostics later, e.g., with plotBCV. Note that the EB shrinkage is repeated within glmQLFit using the QL dispersions. Here, robust=TRUE in glmQLFit is analogous to that in estimateDisp, as it protects the EB shrinkage from outlier dispersions (QL, not NB). Finally, abundance.trend=TRUE is already the default, and just ensures that EB shrinkage of the QL dispersions is performed towards a mean-dependent trend based on the QL dispersions. This trend is distinct from that generated by estimateDisp, which concerns the NB dispersions.

So, all in all, there's a lot of options that can be very confusing, but all of them work reasonably well so it's not too critical what you pick. I would recommend using the QL pipeline because it provides more accurate error control as well as robustness to outliers. (The fact that I'm first author on the associated report is, of course, just a coincidence.)

ADD COMMENTlink modified 20 months ago • written 20 months ago by Aaron Lun17k

Thank you very much for the detailed answer! I'm glad I was on the right track. With regards to edgeR robust, how does it compare to using limma+voom with array weights? Both seem to involve around dampening the effect of outliers, but seem to have a different focus, with limma+voom weighting samples vs and edgeR robust weighting features, respectively.

ADD REPLYlink written 20 months ago by maltethodberg40
1

Array weights are focused on dealing with outlier samples, whereas the various robustness functions in edgeR deal with outlier observations in each feature, either by downweighting the observations or the entire feature. Depends on what you want to do; you use information across all features to estimate the array weight for each sample, so it's pretty reliable, but they won't provide protection when the outlier observations are stochastically distributed across many samples.

ADD REPLYlink modified 20 months ago • written 20 months ago by Aaron Lun17k
4
gravatar for Gordon Smyth
20 months ago by
Gordon Smyth32k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth32k wrote:

Fair question. We do offer a lot of alternative pipelines and it could be confusing.

Aaron has already covered most things, but to understand what robust=TRUE does in estimateDisp(), glmQLFit() and eBayes() see our recently accepted manuscript:

  http://arxiv.org/abs/1602.08678

The option has the same meaning in all three functions.

Here is quick overview of the different robustness concepts:

  1. One or more samples are outliers: use arrayWeights() or voomWithQualityWeights() in limma.
  2. One or more genes are more variable than the others: use robust=TRUE in the Bayes step in either edgeR or limma.
  3. There are outlier observations, but not associated with any particular sample or any particular gene: use lmFit(,method="robust") in limma or estimateGLMRobustDisp() in edgeR.

The third option is what statisticians usually think of as robustness, but the first two options are more efficient because they involve global models. Option 1 assumes that outlier observations tend to occur for particular samples. Option 2 assumes that outlier observations tend to be associated with particular genes or with a group of genes. This is more common than you might think.

ADD COMMENTlink modified 20 months ago • written 20 months ago by Gordon Smyth32k

Thank you for the response and link. I was not aware that lmFit had an option for robust regression. So in theory it would be possible to combine all 3 robustness concepts into one by running voomWithQualityWeights > lmFit(method="robust") > eBayes(robust=TRUE)?

ADD REPLYlink written 20 months ago by maltethodberg40
1

Possible but very, very unlikely to be sensible.

(1) and (2) consist of concise global models and work ok together but (3) involves estimating a robustness weight for every single observation and would be best used by itself.

ADD REPLYlink modified 20 months ago • written 20 months ago by Gordon Smyth32k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 299 users visited in the last hour