Question

Filter genes by expression for voomLmFit

0

Entering edit mode

annikagable • 0

@78889942

Last seen 3 months ago

Switzerland

Following up on limma::voom vs edgeR::voomLmFit - when to use?

I'm wondering:

With limma::voom, we always filtered out lowly-expressed genes (typically using edgeR::filterByExpr) beforehand, because voom did not work well for low-count genes (and also because it reduces the multiple-testing problem). However, with edgeR::voomLmFit, which can actually handle low counts, could we actually filter out much less, e.g. only genes that have no counts at all?

I am aware that prefiltering reduces the multiple testing problem, and that the filtering is also supposed to remove technical noise, but probably we should still aim to filter as little as possible?

DEG edgeR limma voom filterByExpr • 537 views

ADD COMMENT • link updated 4 months ago by Gordon Smyth 50k • written 4 months ago by annikagable • 0

2

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 15 hours ago

United States

voomLmFit is meant to be able to correctly specify the degrees of freedom when you have lots of zeros (as well as iterating if you use duplicateCorrelation or sample weights). It doesn't 'fix' the excessive variation of low read count data, so you should still exclude genes that are measured at a level where noise is likely to predominate over signal.

ADD COMMENT • link 4 months ago James W. MacDonald 65k

score 3 · Accepted Answer · 2023-12-06

There are several reasons why low-count genes are filtered:

The dispersion of low-counts genes is hard to estimate and misestimation may cause the global mean-variance relationship to be distorted.
Very low expression genes have little biological significance.
Genes with very few counts cannot achieve statistical significance even if differential expression is present. If kept in the analysis they increase the amount of multiple testing and decrease statistical power without any compensating advantage.

Using voomLmFit, the first point is largely mitigated but not completely removed. Estimating variances when the total read count is just 1 or 2 is still essentially impossible. The second and third points still apply, so filtering with filterByExpr is still recommended. filterByExpr is designed to keep only those genes that have some change of achieving statistical significance when meaningful DE is present.

In summary, there in no purpose in retaining genes with just one or two counts. They cannot ever be signficantly DE so they just decrease the power of the analysis for no gain.