Question

Filtering lowly expressed genes in voom-limma analysis

0

Entering edit mode

rubi ▴ 110

@rubi-6462

Last seen 5.7 years ago

Hi,

This is a pretty well discussed subject, nevertheless every time I analyze some new form of RNA-seq data I hit the issue of how to filter lowly expressed genes in a differential expression analysis.

My data are read counts of micro-RNAs, which have somewhat of a lower expression range than mRNA.

I have 4 experimental conditions (4 genotypes), with 3 sample for each one, which I'm using limma for the differential expression analysis.

If I follow the limma guide and keep exons that have more than 1 cpm in at least 3 samples I loose quite a lot of microRNAs, some of them are real signal, since these 3 samples may all be from the same genotype that is down regulated.

Perhaps a more sensible filtering approach is to set to zero all samples of a certain experimental condition for which 3 or more samples have cpm <= 1. The problem here is that the cutoff is arbitrary and therefore genes which in one condition were a bit below the cutoff and hence set to 0, but in another condition were a bit above it and hence left as they are, will be false positives.

So my question is if there is a happy medium?

voom limma rna-seq differential gene expression • 4.4k views

ADD COMMENT • link updated 7.8 years ago by Gordon Smyth 50k • written 7.8 years ago by rubi ▴ 110

score 4 · Answer 1 · 2016-07-28

There is no recommendation in the limma guide that you use cpm>1. You are supposed to adjust the cpm cutoff according to the library sizes in your sample, and I don't think you have done that. Have a read of the filtering section in this article for more guidance:

http://bioinf.wehi.edu.au/edgeR/F1000Research2016/edgeRQL.pdf

The only reason why cpm>1 was used in the Pasilla case study in the limma User's Guide was because the smallest library sizes were a bit over 10 million, so cpm=1 corresponded to counts of a little over 10.

I don't understand the concern of your post, because filtering cannot be determined by down-regulated genes. Also, it is not correct to base the filtering on "all samples of a certain experimental condition". The filtering has to be independent of the experimental design.

score 3 · Answer 2 · 2016-07-28

3

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

You can filter at 0.5 cpm or 0.25 cpm ... or lower. The thing you have to watch out for is that the red fit line in the "voom plot" doesn't whip back down on the left side of the x axis.

ADD COMMENT • link 7.8 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Interesting, could you explain in more detail why this would be a problem?

ADD REPLY • link 7.8 years ago maltethodberg ▴ 180

3

Entering edit mode

The decrease in the variance at low abundances is due to the discreteness of the counts near zero, which limits the possible variability of the log-transformed expression values and compromises the accuracy of the linear model. The decrease also messes up the loess fit by making the trend more complicated, as it is no longer monotonically decreasing with abundance; and it interferes with estimation of the prior degrees of freedom, as discreteness results in variance estimates that are much more precise than expected (simply because the estimates are constrained by small count sizes, so they can't vary much).

ADD REPLY • link 7.8 years ago Aaron Lun ★ 28k