It is pretty well-established at this point that independent filtering of low-abundance genes based on average CPM across all samples irrespective of which group they belong to is a good way to increase one's statistical power. However, I'm wondering how one would implement this correctly in the case where the treatment groups vary greatly in size. For example, imagine that group A has 20 samples while group B has only 5. In this case, then the average CPM across all samples would be highly correlated to the mean in group A, so it seems like this would bias the results in favor of genes that are downregulated in group B, since low counts in the 5 group B samples would not significantly affect the overall mean CPM. The obvious alternative would be to filter on a weighted average CPM where each sample is weighted by the inverse of its group size, so that the total weight of each group is equal. However, this is no no longer blind to the experimental design, since the group labels were used to assign the weights.

So is either of these methods correct? If so, which one? If not, what is the right way to do this?

Just to add, the average abundance (NB mean) that Aaron refers to is the AveLogCPM quantity that is automatically computed as part of the edgeR pipeline. It is the same quantity that appears on the logCPM column of the topTags output. It is an approximately independent statistic, relative to any contrast of the group means not involving the intercept, regardless of the size of each group.

In the limma-voom context, the equivalent quantity would be the weighted average of the logCPMs, using the voom weights.