Suggestions for non-specific filtering criteria with limma
3
1
Entering edit mode
rvinisko ▴ 10
@rvinisko-8178
Last seen 8.3 years ago
United States

Hello, I have a set of RNA seq data from a clinical trial with 2 treatment groups (active (n~90 pts), placebo (n~35 pts) and each patient has 2 RNA samples (pre and post treatment). I plan to use limma with the voom transformation for analysis to detect DEG between the two groups. Can someone provide guidanance or suggestions on non-specific filtering criteria to filter low expressed genes and the rationale behind a chosen cutoff? My understanding is the data need to be normalized and the voom exptects the lowly expressed genes to be removed for the weight calculations. I've tried a few available tools and suggested methods but would like help choosing a methodology for cutoff selection(s) for this design.

thanks, richard

limma genefilter • 2.1k views
ADD COMMENT
3
Entering edit mode
@gordon-smyth
Last seen 42 minutes ago
WEHI, Melbourne, Australia

You are right that voom expects very low count genes to have been filtered.

There are a number of ways you can do this and most of them work satisfactorily. You could choose genes with a minimum total count or you could run aveLogCPM in the edgeR package and filter on that. Alternatively, here is a slightly more complex cpm method that I often use:

Your study has four groups of samples (active-pre, active-post, placebo-pre and placebo-post). Let's say a count of 5 is a minimum reasonable size. A gene would surely have to have a count of at least 5 in at least some samples before you would consider it interesting. More precisely, you would surely only be interested in genes with a count of 5 or more in at least half the samples of one of the groups. The minimum number of samples in a group in your study is 35, half of that is 18. So I would keep genes that achieve cpm > 5/M in at least 18 samples, where M is the median library size in millions.

ADD COMMENT
0
Entering edit mode
svlachavas ▴ 830
@svlachavas-7225
Last seen 6 months ago
Germany/Heidelberg/German Cancer Resear…

Dear Richard,

as you mentioned your technology is RNA-seq, you could check for start the limma user's guide 

https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf on page 119,

which describes a kind a kind of filtering that can be performed on the number of "total counts".

Also, you can check the online course material from the bioconductor repository: 

http://www.bioconductor.org/help/course-materials/

and especially http://www.bioconductor.org/help/workflows/rnaseqGene/

in the Pre-filtering section, it states:

"we can remove the rows that have no or nearly no information about the amount of gene expression. Here we remove rows of the DESeqDataSet that have no counts, or only a single count across all samples:.."

But i naively believe(as im not an expert on RNA-seq) data that your rationale about a pre-defined cutoff in low-count genes is a bit arbitary, and is highly related on your experimental design-whereas other filtering methodologies in previous steps-such as filtering on low quality reads are more "defined".

Hope that helps-Best,

Efstathios

ADD COMMENT
0
Entering edit mode
@wolfgang-huber-3550
Last seen 7 days ago
EMBL European Molecular Biology Laborat…

This manuscript http://dx.doi.org/10.1101/034330 discusses how to weight hypotheses (weighting is a generalization of filtering) in a data-driven way. It is accompanied by the IHW package.

ADD COMMENT
0
Entering edit mode

Can you write a protocol about how to use IHW after using limma, for improving the accuracy of multiple test?

ADD REPLY

Login before adding your answer.

Traffic: 664 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6