It is generally advised to filter low count genes before performing differential expression on RNA-seq data, and this should be done independently of the sample labels or metadata to avoid data snooping. But what if sample sequencing depth happens to be strongly correlated with groups? Does using a filter based on raw counts introduce bias in this case, since the group with more deeply-sequenced samples will have more low expression genes that pass the filter than the other group? What is the recommended procedure in this case?
As you've realized, filtering on raw counts can be problematic. For example, if you require at least n libraries to be above some raw count threshold, then you'll select for genes that are (possibly spuriously) upregulated in the group with larger libraries. Similarly, if you filter on the sum of counts across all libraries for each gene, you will favour genes that have higher expression in the larger libraries:
> a <- rnbinom(10000, mu=1000, size=20) # large library > b <- rnbinom(10000, mu=100, size=20) # small library > keep <- a + b > 1700 # row sum filter (arbitrary) > median(log(a/b/10)) # should be near zero, as no DE present  0.004938763 > median(log(a/b/10)[keep]) # systematic bias towards larger library  0.4951755
This is because of the mean-variance relationship, whereby you're more likely to get a large count sum if the counts are randomly greater in the larger libraries. Check out our 2014 paper for some more simulations, albeit in the context of peak calling for ChIP-seq data.
All examples of filtering in the
edgeR user's guide all involve the use of CPMs, so differences in library sizes between samples or groups shouldn't pose a problem. In my analyses, I like to filter on the average abundance (as computed with the
aveLogCPM function) which also accounts for differences in library sizes between samples. If you stick to either of these strategies, you should be okay.