Question: Suggestions for non-specific filtering criteria with limma
gravatar for rvinisko
2.6 years ago by
United States
rvinisko10 wrote:

Hello, I have a set of RNA seq data from a clinical trial with 2 treatment groups (active (n~90 pts), placebo (n~35 pts) and each patient has 2 RNA samples (pre and post treatment). I plan to use limma with the voom transformation for analysis to detect DEG between the two groups. Can someone provide guidanance or suggestions on non-specific filtering criteria to filter low expressed genes and the rationale behind a chosen cutoff? My understanding is the data need to be normalized and the voom exptects the lowly expressed genes to be removed for the weight calculations. I've tried a few available tools and suggested methods but would like help choosing a methodology for cutoff selection(s) for this design.

thanks, richard

ADD COMMENTlink modified 2.6 years ago by Wolfgang Huber13k • written 2.6 years ago by rvinisko10
gravatar for Gordon Smyth
2.6 years ago by
Gordon Smyth34k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth34k wrote:

You are right that voom expects very low count genes to have been filtered.

There are a number of ways you can do this and most of them work satisfactorily. You could choose genes with a minimum total count or you could run aveLogCPM in the edgeR package and filter on that. Alternatively, here is a slightly more complex cpm method that I often use:

Your study has four groups of samples (active-pre, active-post, placebo-pre and placebo-post). Let's say a count of 5 is a minimum reasonable size. A gene would surely have to have a count of at least 5 in at least some samples before you would consider it interesting. More precisely, you would surely only be interested in genes with a count of 5 or more in at least half the samples of one of the groups. The minimum number of samples in a group in your study is 35, half of that is 18. So I would keep genes that achieve cpm > 5/M in at least 18 samples, where M is the median library size in millions.

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Gordon Smyth34k
gravatar for svlachavas
2.6 years ago by
Greece/Athens/National Hellenic Research Foundation
svlachavas610 wrote:

Dear Richard,

as you mentioned your technology is RNA-seq, you could check for start the limma user's guide on page 119,

which describes a kind a kind of filtering that can be performed on the number of "total counts".

Also, you can check the online course material from the bioconductor repository:

and especially

in the Pre-filtering section, it states:

"we can remove the rows that have no or nearly no information about the amount of gene expression. Here we remove rows of the DESeqDataSet that have no counts, or only a single count across all samples:.."

But i naively believe(as im not an expert on RNA-seq) data that your rationale about a pre-defined cutoff in low-count genes is a bit arbitary, and is highly related on your experimental design-whereas other filtering methodologies in previous steps-such as filtering on low quality reads are more "defined".

Hope that helps-Best,


ADD COMMENTlink modified 2.6 years ago by Gordon Smyth34k • written 2.6 years ago by svlachavas610
gravatar for Wolfgang Huber
2.6 years ago by
EMBL European Molecular Biology Laboratory
Wolfgang Huber13k wrote:

This manuscript discusses how to weight hypotheses (weighting is a generalization of filtering) in a data-driven way. It is accompanied by the IHW package.

ADD COMMENTlink written 2.6 years ago by Wolfgang Huber13k

Can you write a protocol about how to use IHW after using limma, for improving the accuracy of multiple test?

ADD REPLYlink written 11 months ago by fubeide0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 190 users visited in the last hour