Search
Question: edgeR glmLRT with test="F"?
2
3.0 years ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson6.8k wrote:

In the past few releases, edgeR's glmLRT has supported an alternative significance test via the test="F" argument. However, I don't really see any documentation of the method used for this test and how it differs from the normal chi-squared likelihood ratio test. Is this alternative test recommended for normal use, and if so, when would you want to use it instead of the chi-squared or quasi-likelihood tests?

modified 3.0 years ago by Aaron Lun19k • written 3.0 years ago by Ryan C. Thompson6.8k
5
3.0 years ago by
Aaron Lun19k
Cambridge, United Kingdom
Aaron Lun19k wrote:

The idea with test="F" was to account for the uncertainty in the estimates of the tagwise NB dispersion for each gene, in order to improve the performance of the GLM framework in the presence of variable dispersions. However, the math didn't pop out nicely because of the difficulties of working with adjusted profile likelihoods during NB dispersion estimation. If I remember correctly, it ended up being conservative during testing.

We think that, if variability and uncertainty in estimation needs to be considered, this would be better achieved with the quasi-likelihood framework. Indeed, I use glmQLFit and glmQLFTest for most of my analyses (RNA-seq, ChIP-seq and Hi-C), only using the chi-squared test in glmLRT when I don't have any replicates.

1

Indeed, the quasi-likelihood framework, i.e., glmQLFit and glmQLFTest, is preferred over the glmLRT in general. In some rare occasions where there are no replicates, one might have to fit GLMs using some prior estimates of NB dispersions and call glmLRT for the DE analysis.

The glmLRT with test="F" was implemented to account for the uncertainty of the NB dispersion estimates in the LRT framework. Now considering the limited scenarios where it is applicable, we decided to phase out the F-test of the glmLRT in the next release to avoid confusion.

I just noticed something that you guys might feel interesting, although I am not sure if it is true: if you have done ERCC normalization, even with replicates, LRT might be a better choice.

The reason I am saying this is that I have ERCC normalized samples, fed into RSEM, and then to edgeR. The p-values obtained through LRT are in general much smaller (e.g. e-300) than the QLF p-values (e.g. e-20), and the top genes found through LRT seem to make more sense (i.e. identified and validated previously with biological experiments). It might be that the ERCC normalization minimized between library variations of gene expression levels, and that makes the LRT a better estimate?

1

I am reluctant to gauge the suitability of a computational method based on anecdotal evidence. Getting lower p-values provides no indication that a method is doing better or worse, especially if you don't know whether type I error is controlled. Indeed, getting lower p-values from LRT almost definitely reflects the fact that it is liberal. Also, the phrase "seem to make more sense" is too vague and subjective to make concrete decisions from a quantitative perspective.

In your specific case, ERCC normalization (for bulk RNA-seq data, presumably?) can yield odd behaviour prior to DE analyses, see https://dx.doi.org/10.1038/nbt.2931 for details.

Read the paper you mentioned before. ERCC is meant to reduce the variance generated during PCR amplification and library preparation, so it should have its merit. I am not saying we should use ERCC to replace other normalization methods like the one used in edgeR. Our experience with ERCC normalization is usually reasonably good as shown in the RLE plots and a couple of other plots. Not sure how the authors of the paper chose their ERCC samples. Also, the paper admitted RUVg(ERCC) is as good as RUVg (which could also use ERCC) -- not sure about the difference between the two, though. My question can be simplified as: if we do ERCC normalization followed by edgeR, whether LRT is more appropriate than GLF. I have not seen such comparisons yet. Theoretically, why GLF is preferred in general? Would ERCC normalization change such preference?

Thanks.

1

The problem with ERCCs is that it is not clear how much should be added to each sample. One should obviously add "the same amount of spike-in RNA" to each sample, but this has many different definitions in bulk RNA-seq. Should we add the the same molar quantity of spike-ins per tube? Per microgram of endogenous RNA? Per millions of cells in the sample? This ambiguity is part of the reason why spike-ins are difficult to use in practice, despite being a nice idea in theory. This is reflected in Figures 4d and 5 of the paper mentioned previously, where the inter-sample log-fold changes for the spike-ins are not similar to the log-fold change in the majority of genes.

Anyway, let's get back to your question. The QL F-test is better than the LRT as it provides better type I error control, by accounting for the uncertainty of the dispersion estimates. (There should be a simulation somewhere in the Lund et al. paper.) ERCC normalization does not change this, provided that the ERCC normalization factors are correct. If they are incorrect, all bets are off - you should not hope that the error in the LRT will cancel out the error in the normalization.

P.S. I should add that Risso et al. clearly state that spike-ins cannot be reliably used for scaling normalization. Running RUVg on the spike-ins is a completely different matter, and not something that edgeR does by default.

Thanks much for the reply. How about the type II error of QLF vs LRT? Also, why replicates have anything to do with the choice of QLF/LRT? Does QLF take into account of within group variation (as revealed by the replicates) while LRT does not?

1

Type II error rates cannot be directly compared as the LRT does not hold its size. I don't recall seeing or making ROC plots, though I would be surprised if there was a major difference in power.

As for the replicates - they have everything to do with the choice between the QL F-test and the LRT, because the major difference between the two approaches lies in how they model variability between replicates from each group. The Lund et al. 2012 paper is a good place to start if you want to know more.

The edgeR user's guide states:

"While the likelihood ratio test is a more obvious choice for inferences with GLMs, the QL F-test is preferred as it reflects the uncertainty in estimating the dispersion for each gene. It provides more robust and reliable error rate control when the number of replicates is small."

What does it mean by "likelihood ratio test is a more obvious choice for inferences with GLMs"? It seems it's good for something (but not RNA-seq). How many replicates are considered to be small? If QLF is preferred when the number of replicates is small, then why when the number of replicates is the smallest (i.e. 1) one can use LRT instead? What to use when the number of replicates is big? For one RNA-seq project, I have 4 biological replicates for each condition, and I have total 32 samples with 2 cell types and 4 compound concentrations (including 0). Is the number of replicates considered to be small here?

Sorry for the nuance questions. Downloaded the paper you mentioned above but it's too statistical for me, I guess.

Thanks a lot and hope these are my last questions. :)

1

The LRT is typically used for GLMs that don't need to estimate a dispersion parameter, e.g., logistic regression, Poisson models. However, for negative binomial models, the dispersion parameter needs to be estimated from the same data used to fit the model.This introduces extra uncertainty into the model fit that is not handled by the LRT. This motivates the use of the QL F-test instead.

For data sets with many replicates, the LRT might be usable as the uncertainty of dispersion estimation could be ignored. However, what statisticians consider to be a "large" number of replicates (>50 residual d.f., perhaps?) is usually not achievable for most genomics studies. The QL F-test is good to use (i.e., holds its size) at all levels of meaningful replication, which is why we recommend it.

For situations involving only one sample in each group, the QL F-test cannot be used. The LRT can be used, but not in any reliable sense; you give it a more-or-less arbitrary dispersion value, and I would not trust the results coming out of it. The message from the user's guide for such cases is that there is little rigorous statistical analysis that can be done when the experimental design is rubbish.