Question

Appropriate ways to filter counts data for voom/limma

1

Entering edit mode

blofeld ▴ 10

@blofeld-11309

Last seen 7.5 years ago

Bonn, Germany

Hi everybody,

I have Illumina RNA Seq data from human biopsies (2 groups, 50 samples each). My pipeline is trimmomatic-star-featureCounts-edgeR-voom/limma (big fan of limma, being a veteran from array days..).

I am looking into the optimal ways to filter my data to increase detection power. Thus far, I have used the method from the limma manual to filter by cpm in >50 samples. I have stumbled over discussions of alternative ways to filter, e.g. htsfilter, or filtering by variance. I recall, however, that certain ways to filter do not work well with limma.

What is your recommendation? Is it appropriate to try the alternative ways to filter here, or would you recommend me to stick with cpm-based filtering, or no filtering at all?

Many thanks!

limma edger sequencing • 3.5k views

ADD COMMENT • link updated 7.7 years ago by Gordon Smyth 50k • written 7.7 years ago by blofeld ▴ 10

score 6 · Answer 1 · 2016-08-17

I think it is always best to take a simple intuitive approach that is appropriate for your experiment.

You are lucky enough to have a large number of replicates in each group. On the other hand, the samples are human biopsies so are likely to be quite heterogeneous. Ask yourself how many samples a gene would have to be expressed in to be interesting to you. If a gene was expressed in 30 samples of one group and not in any samples of the other group, would you want to detect that gene? Would that be consistent enough? What about 25 samples of one and none of the other? Let's say you want at least 30.

Let L be the smallest library size in your data. If you keep genes that satisfy

keep <- rowSums( cpm(y) > 5/L*1e6 ) >= 30

then you will be keeping genes that have a count of at least 5 in at least 30 samples. That would probably perform well for your data.

I haven't tried the htsfilter package, but I tend to prefer simple methods. Variance filtering is IMO never appropriate for RNA-seq data.

Comment added two years later (Nov 2018)

edgeR now has a function filterByExpr() that you can try. It tries to make sensible decisions along the lines outlined above for you:

keep <- filterByExpr(y, design)

score 1 · Answer 2 · 2016-08-17

Generally, the only filtering required is to filter out low-count genes. You don't want to perform variance filtering, because this will mislead limma's variance modeling.

There are a couple of ways to perform low-count filtering. The simplest is to filter by average log CPM (as returned by the aveLogCPM function in edgeR). For samples that mostly have the same set of genes expressed, this is probably the best approach. If you have a more unusual design, you might need to instead filter on quantiles of the CPM values. I have found this necessary when there are many genes expressed at a high level in one group but completely absent in the other, as would be the case for cross-tissue comparisons. For instance, if comparing two tissues, requiring that the 25% percentile of the CPM values is above your threshold guarantees that the gene is present in at least half of the samples for both tissues. Alternatively, if such genes are of interest, you may prefer to leave them in and use edgeR or DESeq2, which should better be able to handle zero counts in one group as long as the other groups has some counts from which to estimate a dispersion.