Question: Filtering lowly expressed genes in voom-limma analysis
gravatar for rubi
14 months ago by
rubi70 wrote:


This is a pretty well discussed subject, nevertheless every time I analyze some new form of RNA-seq data I hit the issue of how to filter lowly expressed genes in a differential expression analysis.


My data are read counts of micro-RNAs, which have somewhat of a lower expression range than mRNA.

I have 4 experimental conditions (4 genotypes), with 3 sample for each one, which I'm using limma for the differential expression analysis.

If I follow the limma guide and keep exons that have more than 1 cpm in at least 3 samples I loose quite a lot of microRNAs, some of them are real signal, since these 3 samples may all be from the same genotype that is down regulated. 

Perhaps a more sensible filtering approach is to set to zero all samples of a certain experimental condition for which 3 or more samples have cpm <= 1. The problem here is that the cutoff is arbitrary and therefore genes which in one condition were a bit below the cutoff and hence set to 0, but in another condition were a bit above it and hence left as they are, will be false positives.


So my question is if there is a happy medium?

ADD COMMENTlink modified 14 months ago by Gordon Smyth31k • written 14 months ago by rubi70
gravatar for Gordon Smyth
14 months ago by
Gordon Smyth31k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth31k wrote:

There is no recommendation in the limma guide that you use cpm>1. You are supposed to adjust the cpm cutoff according to the library sizes in your sample, and I don't think you have done that. Have a read of the filtering section in this article for more guidance:

The only reason why cpm>1 was used in the Pasilla case study in the limma User's Guide was because the smallest library sizes were a bit over 10 million, so cpm=1 corresponded to counts of a little over 10.

I don't understand the concern of your post, because filtering cannot be determined by down-regulated genes. Also, it is not correct to base the filtering on "all samples of a certain experimental condition". The filtering has to be independent of the experimental design.


ADD COMMENTlink modified 14 months ago • written 14 months ago by Gordon Smyth31k
gravatar for Steve Lianoglou
14 months ago by
Steve Lianoglou12k wrote:

You can filter at 0.5 cpm or 0.25 cpm ... or lower. The thing you have to watch out for is that the red fit line in the "voom plot" doesn't whip back down on the left side of the x axis.


ADD COMMENTlink modified 14 months ago • written 14 months ago by Steve Lianoglou12k

Interesting, could you explain in more detail why this would be a problem?

ADD REPLYlink written 14 months ago by maltethodberg40

The decrease in the variance at low abundances is due to the discreteness of the counts near zero, which limits the possible variability of the log-transformed expression values and compromises the accuracy of the linear model. The decrease also messes up the loess fit by making the trend more complicated, as it is no longer monotonically decreasing with abundance; and it interferes with estimation of the prior degrees of freedom, as discreteness results in variance estimates that are much more precise than expected (simply because the estimates are constrained by small count sizes, so they can't vary much).

ADD REPLYlink modified 14 months ago • written 14 months ago by Aaron Lun16k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 276 users visited in the last hour