Question: Suggestions for non-specific filtering criteria with limma
gravatar for rvinisko
22 months ago by
United States
rvinisko10 wrote:

Hello, I have a set of RNA seq data from a clinical trial with 2 treatment groups (active (n~90 pts), placebo (n~35 pts) and each patient has 2 RNA samples (pre and post treatment). I plan to use limma with the voom transformation for analysis to detect DEG between the two groups. Can someone provide guidanance or suggestions on non-specific filtering criteria to filter low expressed genes and the rationale behind a chosen cutoff? My understanding is the data need to be normalized and the voom exptects the lowly expressed genes to be removed for the weight calculations. I've tried a few available tools and suggested methods but would like help choosing a methodology for cutoff selection(s) for this design.

thanks, richard

ADD COMMENTlink modified 22 months ago by Wolfgang Huber13k • written 22 months ago by rvinisko10
gravatar for Gordon Smyth
22 months ago by
Gordon Smyth32k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth32k wrote:

You are right that voom expects very low count genes to have been filtered.

There are a number of ways you can do this and most of them work satisfactorily. You could choose genes with a minimum total count or you could run aveLogCPM in the edgeR package and filter on that. Alternatively, here is a slightly more complex cpm method that I often use:

Your study has four groups of samples (active-pre, active-post, placebo-pre and placebo-post). Let's say a count of 5 is a minimum reasonable size. A gene would surely have to have a count of at least 5 in at least some samples before you would consider it interesting. More precisely, you would surely only be interested in genes with a count of 5 or more in at least half the samples of one of the groups. The minimum number of samples in a group in your study is 35, half of that is 18. So I would keep genes that achieve cpm > 5/M in at least 18 samples, where M is the median library size in millions.

ADD COMMENTlink modified 22 months ago • written 22 months ago by Gordon Smyth32k
gravatar for svlachavas
22 months ago by
Greece/Athens/National Hellenic Research Foundation
svlachavas560 wrote:

Dear Richard,

as you mentioned your technology is RNA-seq, you could check for start the limma user's guide on page 119,

which describes a kind a kind of filtering that can be performed on the number of "total counts".

Also, you can check the online course material from the bioconductor repository:

and especially

in the Pre-filtering section, it states:

"we can remove the rows that have no or nearly no information about the amount of gene expression. Here we remove rows of the DESeqDataSet that have no counts, or only a single count across all samples:.."

But i naively believe(as im not an expert on RNA-seq) data that your rationale about a pre-defined cutoff in low-count genes is a bit arbitary, and is highly related on your experimental design-whereas other filtering methodologies in previous steps-such as filtering on low quality reads are more "defined".

Hope that helps-Best,


ADD COMMENTlink modified 22 months ago by Gordon Smyth32k • written 22 months ago by svlachavas560
gravatar for Wolfgang Huber
22 months ago by
EMBL European Molecular Biology Laboratory
Wolfgang Huber13k wrote:

This manuscript discusses how to weight hypotheses (weighting is a generalization of filtering) in a data-driven way. It is accompanied by the IHW package.

ADD COMMENTlink written 22 months ago by Wolfgang Huber13k

Can you write a protocol about how to use IHW after using limma, for improving the accuracy of multiple test?

ADD REPLYlink written 10 weeks ago by fubeide0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 182 users visited in the last hour