Question

What benchmark should I use for setting the EdgeR filterByExpr min.count parameter?

1

Entering edit mode

Jack S. ▴ 60

@9aa6de71

Last seen 21 months ago

United States

This question is related to a previous query regarding pseudobulking in a single-cell dataset.

Question 1

When performing differential expression testing, how can I determine if my selections for filterByExpr min.count and min.total.count are appropriate?

From what I understand based on discussions about filterByExpr, the default min.count and min.total.count values are generally recommended for most scenarios. However, there is an example in Section 4.10.3 of the EdgeR user's guide where these values are explicitly set in the function call:

keep.genes <- filterByExpr(y, group = y$samples$cluster, 
                           min.count = 10, min.total.count = 20)

Assuming I'm using glmQLFit and glmQLTest as demonstrated in the example in Section 4.10.3, how can I determine if these values are too high or too low? Additionally, when would you suggest modifying the default values?

Question 2

The EdgeR manual mentions that "glmQLFit gives special attention to handling of small counts and zero counts."

Does this imply that glmQLFit and glmQLTest can be used without filtering out low counts? Would you recommend this approach?

Thanks in advance!

edgeR DifferentialExpression • 3.6k views

ADD COMMENT • link updated 14 months ago by Gordon Smyth 53k • written 22 months ago by Jack S. ▴ 60

score 3 · Answer 1 · 2024-03-21

3

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 8 minutes ago

WEHI, Melbourne, Australia

The limma and edgeR User's Guides, case studies, documentation and online posts all consistently advise you to use filterByExpr with default settings. See for example

filterByExpr appears many times in the edgeR User's Guide, all but once with default settings. In the one example that you quote, min.total.count was increased to 20 because there were clusters with only 1 sample, so the author wanted to make sure there were at least 20 reads for each gene. To be honest, I don't think that was necessary and the argument could have been left at the default.

You are worrying about this unnecessarily. Sure, you can reduce the amount of filtering without adversely affecting an analysis. But why bother? You will generally just end up with including more non-DE genes in the analysis and perhaps reduce power a little bit.

The filterByExpr settings were chosen with small experiments in mind. The only time I would think about changing the parameter would be for experiments with a large number of samples. In that case, I might consider reducing min.count and min.prop. To judge whether the settings are appropriate, you simply look at the standard edgeR variance plots from plotBCV and plotSA.

ADD COMMENT • link 22 months ago Gordon Smyth 53k

1

Entering edit mode

Hi Gordon, never thought about changing the defaults of filterByExpr() with a large number of samples. You say that plotBCV() and plotSA() can be used to judge whether non-default values for min.count and min.prop are appropriate, but, and I hope I'm not asking something too trivial, how exactly. I've played a bit lowering min.count from 10 to 7 and min.prop from 0.7 to 0.5 in a dataset with more than 400 samples, and I see slightly larger BCV values and slightly smaller prior DFs, but how do I know when I'm setting better non-default values?

ADD REPLY • link 22 months ago Robert Castelo ★ 3.4k

2

Entering edit mode

It is the minimum group size rather than the total number of samples that is relevant.

min.prop is more a biological parameter rather than statistical. If you find a gene up-regulated in group B vs A, would it still be a biologically meaningful result to you if the gene was only detected in 50% of the samples in group B?

You could reduce min.count if you have large group sample sizes and you want to detect DE genes at low expression levels. If the plotSA trend is still smooth and monotonic, then lower values are fine.

I assume you're using edgeR quasi. The new edgeR 4.0 quasi method with legacy=FALSE is designed to improve edgeR's performance for small counts and large sample sizes. With the new method, you can reduce min.count to very small values without harming edgeR's statistical performance. Of course, that's only useful to you if you are detecting DE genes at low expression levels.

ADD REPLY • link 22 months ago Gordon Smyth 53k

0

Entering edit mode

Hi Gordon, is it advisable to use glmQLFit with legacy =FALSE and reduce min.count for DEG analysis with single-cell data, i.e. pseudobulk( sum of counts of each cell type per sample) as the number of counts can be as low as 3 or 4?

ADD REPLY • link 14 months ago Maziya • 0

1

Entering edit mode

Pseudo-bulk is not at the single cell level, it behaves similarly to bulk RNA-seq except that the library sizes might be quite low. It does raise any new issues not already discussed above. There is still no point in keeping genes with very low counts unless the number of samples is large, because such a gene cannot be significantly DE. You can try reducing min.count if you want, there is little harm in it with edgeR 4, except the amount of multiple testing is somewhat increased.

If you have further questions, please start a new question rather than adding comments to a 7-month-old question.

ADD REPLY • link 14 months ago Gordon Smyth 53k