Question

Suggestions for non-specific filtering criteria with limma

1

Entering edit mode

rvinisko ▴ 10

@rvinisko-8178

Last seen 8.9 years ago

United States

Hello, I have a set of RNA seq data from a clinical trial with 2 treatment groups (active (n~90 pts), placebo (n~35 pts) and each patient has 2 RNA samples (pre and post treatment). I plan to use limma with the voom transformation for analysis to detect DEG between the two groups. Can someone provide guidanance or suggestions on non-specific filtering criteria to filter low expressed genes and the rationale behind a chosen cutoff? My understanding is the data need to be normalized and the voom exptects the lowly expressed genes to be removed for the weight calculations. I've tried a few available tools and suggested methods but would like help choosing a methodology for cutoff selection(s) for this design.

thanks, richard

limma genefilter • 2.4k views

ADD COMMENT • link updated 8.9 years ago by Wolfgang Huber ★ 13k • written 9.0 years ago by rvinisko ▴ 10

score 3 · Answer 1 · 2015-12-13

You are right that voom expects very low count genes to have been filtered.

There are a number of ways you can do this and most of them work satisfactorily. You could choose genes with a minimum total count or you could run aveLogCPM in the edgeR package and filter on that. Alternatively, here is a slightly more complex cpm method that I often use:

Your study has four groups of samples (active-pre, active-post, placebo-pre and placebo-post). Let's say a count of 5 is a minimum reasonable size. A gene would surely have to have a count of at least 5 in at least some samples before you would consider it interesting. More precisely, you would surely only be interested in genes with a count of 5 or more in at least half the samples of one of the groups. The minimum number of samples in a group in your study is 35, half of that is 18. So I would keep genes that achieve cpm > 5/M in at least 18 samples, where M is the median library size in millions.

Gordon Smyth · Answer 2 · 2015-12-11

Dear Richard,

as you mentioned your technology is RNA-seq, you could check for start the limma user's guide

https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf on page 119,

which describes a kind a kind of filtering that can be performed on the number of "total counts".

Also, you can check the online course material from the bioconductor repository:

http://www.bioconductor.org/help/course-materials/

and especially http://www.bioconductor.org/help/workflows/rnaseqGene/

in the Pre-filtering section, it states:

"we can remove the rows that have no or nearly no information about the amount of gene expression. Here we remove rows of the DESeqDataSet that have no counts, or only a single count across all samples:.."

But i naively believe(as im not an expert on RNA-seq) data that your rationale about a pre-defined cutoff in low-count genes is a bit arbitary, and is highly related on your experimental design-whereas other filtering methodologies in previous steps-such as filtering on low quality reads are more "defined".

Hope that helps-Best,

Efstathios

score 0 · Answer 3 · 2015-12-24

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 12 weeks ago

EMBL European Molecular Biology Laborat…

This manuscript http://dx.doi.org/10.1101/034330 discusses how to weight hypotheses (weighting is a generalization of filtering) in a data-driven way. It is accompanied by the IHW package.

ADD COMMENT • link 8.9 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Can you write a protocol about how to use IHW after using limma, for improving the accuracy of multiple test?

ADD REPLY • link 7.3 years ago fubeide • 0