Question

How to determine which edgeR glmFit and glmQLFit are better

1

Entering edit mode

changjiang.h.xu ▴ 10

@changjianghxu-9498

Last seen 8.3 years ago

Hi,

I would determine which of the functions glmFit (glmLRT) and glmQLFit (glmQLFTest) should be used for my RNAseq data using function gof, as follows:

f1 <- glmFit(y, design); g1 <- gof(f1, plot = T)

f2 <- glmQLFit(y, design); g2 <- gof(f2, plot = T)

And then compare the QQ-plots of g1 and g2. My questions are

1. The above is right or not? Should the two functions work on the same y, where y is DGEList object with dispersion estimates?

2. Will the dispersion parameters are re-estimated by glmQLFit? Should the dispersion estimates be different for glmFit and glmQLFit?

3. How to know which is better by comparing the QQ-plots of g1 and g2?

Thanks.

Changjiang

edger • 9.5k views

ADD COMMENT • link updated 8.3 years ago by Gordon Smyth 50k • written 8.3 years ago by changjiang.h.xu ▴ 10

score 6 · Answer 1 · 2016-01-11

6

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 8 hours ago

The city by the bay

gof assumes that the total residual deviance of the fitted GLM follows a chi-squared distribution on the residual degrees of freedom. This is not the case in the quasi-likelihood framework where the deviance is scaled by a gene-specific QL dispersion. Thus, it doesn't seem appropriate to apply gof on f2 - at the very least, you shouldn't compare gof results between f1 and f2, because they aren't really comparable. In general, I would predict that the QL fit will look "worse", but that's because gof doesn't account for presence of the QL dispersion in the model.
glmQLFit uses the trended NB dispersion to fit a GLM, and then estimates the QL dispersion from the deviance. It doesn't re-estimate the NB dispersion but instead estimates an entirely different QL dispersion for each gene. In contrast, glmFit uses the tagwise NB dispersion to fit a GLM - and that's it.
You probably won't be able to tell from the gof results. What I can tell you is that glmQLFit will provide more accurate type I error rate control as it accounts for the uncertainty of the dispersion estimates in a more rigorous manner than the glmFit pipeline. As such, I use it routinely for my analyses.

ADD COMMENT • link 8.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks Aaron. I see gof only works for glmFit. How can I do goodness of fitting test for glmQLFit?

Changjiang

ADD REPLY • link 8.3 years ago changjiang.h.xu ▴ 10

1

Entering edit mode

Well, I'm not sure that it makes all that much sense to do so. Any irregularity in the initial GLM fit will manifest as a change in the distribution of the deviances, but such changes will end up being modelled by glmQLFit via estimation of f2$var.prior or f2$df.prior. Any (deviance-based) goodness-of-fit testing that includes the QL dispersion would effectively cancel itself out, because the purpose of the QL dispersion is to model any "non-goodness of fit", so to speak.

ADD REPLY • link 8.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Aaron is right. You can't do a goodness of fit plot for a QL model fit -- the whole concept of goodness of fit doesn't apply because QL automatically fits dispersions at the gene level.

ADD REPLY • link 8.3 years ago Gordon Smyth 50k

score 3 · Answer 2 · 2016-01-11

3

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

The goodness of fit plot done by gof() is actually only useful for evaluating models glm models with constant or trended dispersion. The plot is designed to evaluate whether these global dispersion models are adequate --- if not then tagwise dispersions are required. The plot doesn't make sense for a glm tagwise dispersion model and it isn't even defined for a QL model.

So the gof() plot can only be used to decide whether a global dispersion model is adequate. It can't be used to choose between glm-tagwise vs a QL model. By definition, both of these models are adequate from a gof() point of view.

As Aaron has already explained, if you are seriously worried about assumptions, then QL is an obvious choice. It is based on a more conservative model and makes fewer assumptions than does glmLRT().

I have to admit that the gof() help page doesn't explain this properly.

ADD COMMENT • link 8.3 years ago Gordon Smyth 50k

0

Entering edit mode

Thank you very much for the details. glmQLFTest seems too conservative for my data because the BH FDR values all equal to 1. Then I would know whether or not glmLRT may be used instead of glmQLFTest.

Changjiang

ADD REPLY • link 8.3 years ago changjiang.h.xu ▴ 10

0

Entering edit mode

glmLRT is a very well regarded method. It can be a little liberal in some circumstances, but only slightly. Why would you not be able to use it?

ADD REPLY • link 8.3 years ago Gordon Smyth 50k

0

Entering edit mode

It is no problem to run both glmLRT and glmQLFTest. Just want to know how to decide which is better or more suitable for my data. Thanks.

Changjiang

ADD REPLY • link 8.3 years ago changjiang.h.xu ▴ 10

0

Entering edit mode

Well, I think that we've already explained how to decide. If you want to be conservative and safe, use QL. If you want to be a bit more aggressive and find all the DE genes you can, still with good FDR control, then use glmLRT. You have already said that you prefer the latter, so use glmLRT.