Question

Filter genes by expression for voomLmFit

0

Entering edit mode

annikagable • 0

@78889942

Last seen 2.0 years ago

Switzerland

Following up on limma::voom vs edgeR::voomLmFit - when to use?

I'm wondering:

With limma::voom, we always filtered out lowly-expressed genes (typically using edgeR::filterByExpr) beforehand, because voom did not work well for low-count genes (and also because it reduces the multiple-testing problem). However, with edgeR::voomLmFit, which can actually handle low counts, could we actually filter out much less, e.g. only genes that have no counts at all?

I am aware that prefiltering reduces the multiple testing problem, and that the filtering is also supposed to remove technical noise, but probably we should still aim to filter as little as possible?

DEG edgeR limma voom filterByExpr • 1.6k views

ADD COMMENT • link updated 2.1 years ago by Gordon Smyth 53k • written 2.1 years ago by annikagable • 0

2

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 2 hours ago

United States

voomLmFit is meant to be able to correctly specify the degrees of freedom when you have lots of zeros (as well as iterating if you use duplicateCorrelation or sample weights). It doesn't 'fix' the excessive variation of low read count data, so you should still exclude genes that are measured at a level where noise is likely to predominate over signal.

ADD COMMENT • link 2.1 years ago James W. MacDonald 68k

score 3 · Accepted Answer · 2023-12-06

There are several reasons why low-count genes are filtered:

The dispersion of low-counts genes is hard to estimate and misestimation may cause the global mean-variance relationship to be distorted.
Very low expression genes have little biological significance.
Genes with very few counts cannot achieve statistical significance even if differential expression is present. If kept in the analysis they increase the amount of multiple testing and decrease statistical power without any compensating advantage.

Using voomLmFit, the first point is largely mitigated but not completely removed. Estimating variances when the total read count is just 1 or 2 is still essentially impossible. The second and third points still apply, so filtering with filterByExpr is still recommended. filterByExpr is designed to keep only those genes that have some change of achieving statistical significance when meaningful DE is present.

In summary, there in no purpose in retaining genes with just one or two counts. They cannot ever be signficantly DE so they just decrease the power of the analysis for no gain.