Question: Suggestions for non-specific filtering criteria with limma
gravatar for rvinisko
2.3 years ago by
United States
rvinisko10 wrote:

Hello, I have a set of RNA seq data from a clinical trial with 2 treatment groups (active (n~90 pts), placebo (n~35 pts) and each patient has 2 RNA samples (pre and post treatment). I plan to use limma with the voom transformation for analysis to detect DEG between the two groups. Can someone provide guidanance or suggestions on non-specific filtering criteria to filter low expressed genes and the rationale behind a chosen cutoff? My understanding is the data need to be normalized and the voom exptects the lowly expressed genes to be removed for the weight calculations. I've tried a few available tools and suggested methods but would like help choosing a methodology for cutoff selection(s) for this design.

thanks, richard

ADD COMMENTlink modified 2.2 years ago by Wolfgang Huber13k • written 2.3 years ago by rvinisko10
gravatar for Gordon Smyth
2.3 years ago by
Gordon Smyth33k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth33k wrote:

You are right that voom expects very low count genes to have been filtered.

There are a number of ways you can do this and most of them work satisfactorily. You could choose genes with a minimum total count or you could run aveLogCPM in the edgeR package and filter on that. Alternatively, here is a slightly more complex cpm method that I often use:

Your study has four groups of samples (active-pre, active-post, placebo-pre and placebo-post). Let's say a count of 5 is a minimum reasonable size. A gene would surely have to have a count of at least 5 in at least some samples before you would consider it interesting. More precisely, you would surely only be interested in genes with a count of 5 or more in at least half the samples of one of the groups. The minimum number of samples in a group in your study is 35, half of that is 18. So I would keep genes that achieve cpm > 5/M in at least 18 samples, where M is the median library size in millions.

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Gordon Smyth33k
gravatar for svlachavas
2.3 years ago by
Greece/Athens/National Hellenic Research Foundation
svlachavas570 wrote:

Dear Richard,

as you mentioned your technology is RNA-seq, you could check for start the limma user's guide on page 119,

which describes a kind a kind of filtering that can be performed on the number of "total counts".

Also, you can check the online course material from the bioconductor repository:

and especially

in the Pre-filtering section, it states:

"we can remove the rows that have no or nearly no information about the amount of gene expression. Here we remove rows of the DESeqDataSet that have no counts, or only a single count across all samples:.."

But i naively believe(as im not an expert on RNA-seq) data that your rationale about a pre-defined cutoff in low-count genes is a bit arbitary, and is highly related on your experimental design-whereas other filtering methodologies in previous steps-such as filtering on low quality reads are more "defined".

Hope that helps-Best,


ADD COMMENTlink modified 2.3 years ago by Gordon Smyth33k • written 2.3 years ago by svlachavas570
gravatar for Wolfgang Huber
2.2 years ago by
EMBL European Molecular Biology Laboratory
Wolfgang Huber13k wrote:

This manuscript discusses how to weight hypotheses (weighting is a generalization of filtering) in a data-driven way. It is accompanied by the IHW package.

ADD COMMENTlink written 2.2 years ago by Wolfgang Huber13k

Can you write a protocol about how to use IHW after using limma, for improving the accuracy of multiple test?

ADD REPLYlink written 7 months ago by fubeide0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 247 users visited in the last hour