As you've realized, filtering on raw counts can be problematic. For example, if you require at least n libraries to be above some raw count threshold, then you'll select for genes that are (possibly spuriously) upregulated in the group with larger libraries. Similarly, if you filter on the sum of counts across all libraries for each gene, you will favour genes that have higher expression in the larger libraries:
> a <- rnbinom(10000, mu=1000, size=20) # large library
> b <- rnbinom(10000, mu=100, size=20) # small library
> keep <- a + b > 1700 # row sum filter (arbitrary)
> median(log(a/b/10)) # should be near zero, as no DE present
> median(log(a/b/10)[keep]) # systematic bias towards larger library
This is because of the mean-variance relationship, whereby you're more likely to get a large count sum if the counts are randomly greater in the larger libraries. Check out our 2014 paper for some more simulations, albeit in the context of peak calling for ChIP-seq data.
All examples of filtering in the
edgeR user's guide all involve the use of CPMs, so differences in library sizes between samples or groups shouldn't pose a problem. In my analyses, I like to filter on the average abundance (as computed with the
aveLogCPM function) which also accounts for differences in library sizes between samples. If you stick to either of these strategies, you should be okay.