Filter genes by expression for voomLmFit
Entering edit mode
Last seen 7 weeks ago

Following up on limma::voom vs edgeR::voomLmFit - when to use?

I'm wondering:

With limma::voom, we always filtered out lowly-expressed genes (typically using edgeR::filterByExpr) beforehand, because voom did not work well for low-count genes (and also because it reduces the multiple-testing problem). However, with edgeR::voomLmFit, which can actually handle low counts, could we actually filter out much less, e.g. only genes that have no counts at all?

I am aware that prefiltering reduces the multiple testing problem, and that the filtering is also supposed to remove technical noise, but probably we should still aim to filter as little as possible?

DEG edgeR limma voom filterByExpr • 350 views
Entering edit mode
Last seen 2 hours ago
WEHI, Melbourne, Australia

There are several reasons why low-count genes are filtered:

  • The dispersion of low-counts genes is hard to estimate and misestimation may cause the global mean-variance relationship to be distorted.
  • Very low expression genes have little biological significance.
  • Genes with very few counts cannot achieve statistical significance even if differential expression is present. If kept in the analysis they increase the amount of multiple testing and decrease statistical power without any compensating advantage.

Using voomLmFit, the first point is largely mitigated but not completely removed. Estimating variances when the total read count is just 1 or 2 is still essentially impossible. The second and third points still apply, so filtering with filterByExpr is still recommended. filterByExpr is designed to keep only those genes that have some change of achieving statistical significance when meaningful DE is present.

In summary, there in no purpose in retaining genes with just one or two counts. They cannot ever be signficantly DE so they just decrease the power of the analysis for no gain.

Entering edit mode
Last seen 11 hours ago
United States

voomLmFit is meant to be able to correctly specify the degrees of freedom when you have lots of zeros (as well as iterating if you use duplicateCorrelation or sample weights). It doesn't 'fix' the excessive variation of low read count data, so you should still exclude genes that are measured at a level where noise is likely to predominate over signal.

Login before adding your answer.

Traffic: 375 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6