Suggestions for non-specific filtering criteria with limma
Entering edit mode
rvinisko ▴ 10
Last seen 7.7 years ago
United States

Hello, I have a set of RNA seq data from a clinical trial with 2 treatment groups (active (n~90 pts), placebo (n~35 pts) and each patient has 2 RNA samples (pre and post treatment). I plan to use limma with the voom transformation for analysis to detect DEG between the two groups. Can someone provide guidanance or suggestions on non-specific filtering criteria to filter low expressed genes and the rationale behind a chosen cutoff? My understanding is the data need to be normalized and the voom exptects the lowly expressed genes to be removed for the weight calculations. I've tried a few available tools and suggested methods but would like help choosing a methodology for cutoff selection(s) for this design.

thanks, richard

limma genefilter • 1.9k views
Entering edit mode
Last seen 2 hours ago
WEHI, Melbourne, Australia

You are right that voom expects very low count genes to have been filtered.

There are a number of ways you can do this and most of them work satisfactorily. You could choose genes with a minimum total count or you could run aveLogCPM in the edgeR package and filter on that. Alternatively, here is a slightly more complex cpm method that I often use:

Your study has four groups of samples (active-pre, active-post, placebo-pre and placebo-post). Let's say a count of 5 is a minimum reasonable size. A gene would surely have to have a count of at least 5 in at least some samples before you would consider it interesting. More precisely, you would surely only be interested in genes with a count of 5 or more in at least half the samples of one of the groups. The minimum number of samples in a group in your study is 35, half of that is 18. So I would keep genes that achieve cpm > 5/M in at least 18 samples, where M is the median library size in millions.

Entering edit mode
svlachavas ▴ 810
Last seen 21 hours ago
Germany/Heidelberg/German Cancer Resear…

Dear Richard,

as you mentioned your technology is RNA-seq, you could check for start the limma user's guide on page 119,

which describes a kind a kind of filtering that can be performed on the number of "total counts".

Also, you can check the online course material from the bioconductor repository:

and especially

in the Pre-filtering section, it states:

"we can remove the rows that have no or nearly no information about the amount of gene expression. Here we remove rows of the DESeqDataSet that have no counts, or only a single count across all samples:.."

But i naively believe(as im not an expert on RNA-seq) data that your rationale about a pre-defined cutoff in low-count genes is a bit arbitary, and is highly related on your experimental design-whereas other filtering methodologies in previous steps-such as filtering on low quality reads are more "defined".

Hope that helps-Best,


Entering edit mode
Last seen 6 days ago
EMBL European Molecular Biology Laborat…

This manuscript discusses how to weight hypotheses (weighting is a generalization of filtering) in a data-driven way. It is accompanied by the IHW package.

Entering edit mode

Can you write a protocol about how to use IHW after using limma, for improving the accuracy of multiple test?


Login before adding your answer.

Traffic: 608 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6