EdgeR glmLRT vs glmQLFTest
2
0
Entering edit mode
Marianna • 0
@7cc5052f
Last seen 18 days ago
Italy

Hi everybody,

afer reading some posts about different DE analysis provided by edgeR, I found that the QL framework it's a the better choice, as it provides more accurate type I error rate control. However, when the dispersions are very large and the counts are very small it's better to switch to the LRT framework. But what do you mean for "very large" and "very small"? Are there reasonable thresholds?

For instance, I'm dealing with a dataset of 12 samples (1 variable/4 factors), and I'm not sure about the best approach to be adopted. Based on plotMDS, plotBVC, and plotQLDisp posted above, what's your feeling?

edgeR DifferentialExpression • 190 views
0
Entering edit mode
@gordon-smyth
Last seen 8 hours ago
WEHI, Melbourne, Australia

In my opinion, you should simply be using the quasi-likelihood pipeline. For a regular bulk RNA-seq experiment with 12 samples you should virtually always be using glmQLFit and glmQLFTest.

when the dispersions are very large and the counts are very small it's better to switch to the LRT framework.

The edgeR documentation does not recommend this. You seem to be quoting Aaron Lun from here ( EDGE-R exact test vs QL F-test ). I will let Aaron Lun respond -- he might have had a single cell or ChIP-seq context in mind.

0
Entering edit mode

he might have had a single cell or ChIP-seq context in mind.

Yes, single-cell. I was fiddling with Smart-seq2 read count data - dispersions routinely greater than 1, average counts less than 1. IIRC the QL tests were not holding their size on simulated data; I think I traced the cause to a deterioration in the saddlepoint approximation such that the deviances were not chi-squared distributed. A toy demonstration is provided below:

library(edgeR)
y <- matrix(rnbinom(5000000, mu=0.5, size=0.5), ncol=50)

y <- DGEList(y)
design <- model.matrix(~gl(2, 25))
fit <- glmQLFit(y, design, dispersion=2) # using the true NB dispersion.
res <- glmQLFTest(fit, coef=2)

mean(res$table$PValue <= 0.01)
## [1] 0.03035
mean(res$table$PValue <= 0.05)
## [1] 0.09861


The issue disappears at larger means and/or smaller dispersions, so is largely irrelevant to discussions of bulk RNA-seq data analysis.

At the time, I ended up using the LRT, but further contemplation led me to realize that there were more fundamental problems with treating cells as replicates of each other. Hence the use of pseudo-bulk samples, which (i) captures the right level of replication without the need to use complex mixed models and (ii) increases the counts and decreases the dispersions such that QL methods can be safely applied (e.g., scran::pseudoBulkDGE).

For ChIP-seq, the QL method can be used without much concern. The counts are usually high enough for regions of interest (usually enriched peaks) and the dispersions in the same ballpark as bulk RNA-seq.

0
Entering edit mode

For the benefit of other readers, let me clarify that all the data rows in the above simulation would be filtered out by filterByExpr and so would not appear in a bulk RNA-seq analysis.

0
Entering edit mode
Marianna • 0
@7cc5052f
Last seen 18 days ago
Italy

Hi all,

thank you very much for your replies.

It seems for bulk RNA-seq analysis the QL approach should be always preferred over LRT. Thus, based also on the manual, likelihood ratio test is an option in single cell RNA-seq analysis and when no replicates are available.

Marianna