Ok, first, a quick summary of the available tests: edgeR and DESeq2 both offer a likelihood ratio test based on a negative binomial generalized linear model. DESeq2 also offers a Wald test, while edgeR also offers the quasi-likelihood F-test. limma offers ordinary linear model t-tests and F-tests. All of them use empirical Bayes methods to squeeze dispersion/variance estimates toward the mean of the dataset. (edgeR and DESeq 1 also offer some older tests, but I believe they're effectively deprecated now, and I can't think of a reason you would choose them over the ones mentioned above.)

Second, let's look at the theoretical differences between the tests: The edgeR and DESeq2 LRTs will not give identical results because they use slightly different methods to estimate the dispersions. However, if you force both of them to use the same dispersions, they should give you identical results. I believe the Wald test and LRT are asymptotically equivalent, although experiments with few samples might not be close to the asymptotic behavior. In addition, DESeq2 shrinks the logFC values toward zero when using the Wald test, but not the LRT. I believe the logFC values and standard errors are generally shrunk by a similar amount, so this shrinkage doesn't have much of an effect on the p-values. The LRT & Wald test are generally more liberal than the QL test and limma, because they require estimating the dispersions in a separate step and then fitting the GLM assuming the dispersions are estimated perfectly, ignoring the estimation uncertainty.

In practice, I've found that both LRTs and DESeq2's Wald test give quite similar results to each other on the same test, while edgeR's quasi-likelihood F-test and limma's t/F-tests also give quite similar results to each other. You will usually get more genes with the LRT (and Wald), but their liberalness also causes them to sometimes give more false positives than the nominal FDR would imply (e.g. giving many significant genes on a null comparison). The QL test and limma, in contrast, are more conservative (fewer significant genes), but tend to maintain proper false positive control in most cases. In theory, limma's use of the normal distribution as an approximation breaks down at low counts, and anecdotally this seems to be the case. The deeper your sequencing is and the more samples you have, the better limma will be. limma will also run much more quickly than the others on large data sets due to the use of a linear model instead of a GLM. Lastly, limma can fit some experimental designs that cannot be fit by edgeR or DESeq2, specifically models with unequal group/sample variances (beyond the inherent differences in variance due to differences in abundance) and certain kinds of mixed models. However, this probably isn't relevant for your purposes since you seem to already have a design that works with all 3.

Anyway, to answer your question, DESeq2, edgeR and limma all give pretty similar results, though there are some cases when one is better or worse than the others. I think you would be better off choosing the most appropriate one for your experiment rather than trying to average or overlap their results. My general advice is, for lots of well-sequenced samples, use limma and enjoy the speed. For few samples and/or low sequencing depth, use DESeq2 or edgeR. My personal preference is to use the QL test rather than the LRT or Wald tests because I usually prioritize strict false positive control over a larger gene list, but your priorities may be different. In any case, the differences between methods are minor compared to the importance of using exploratory data analysis to ensure you are fitting the model that best describes your data.

However, I often find it useful to run more than one tool simply for the diagnostics that they produce. For example, running edgeR or DESeq2 to view the BCV/dispersion plots even though I'm using limma, or running limma's voomWithQualityWeights to see if it suggests any outlier samples even though I plan to use edgeR.

(To any of the package authors: This is all from memory. Feel free to correct me on any details I got wrong.)

One small note on the Negative Binomial methods and "unequal group variances": groups with different expected count will of course have different within-groupĀ

variancesaccording to the NB model, it's the dispersion (~square of the coefficient of variation) that is fixed for a given gene. I know this is what you meant, but it sometimes confuses people who are reading about the NB models for the first time.You're right, I'll edit my post.

Great elaborate answer, thanks a lot for the explanation and your opinion!