It is pretty well-established at this point that independent filtering of low-abundance genes based on average CPM across all samples irrespective of which group they belong to is a good way to increase one's statistical power. However, I'm wondering how one would implement this correctly in the case where the treatment groups vary greatly in size. For example, imagine that group A has 20 samples while group B has only 5. In this case, then the average CPM across all samples would be highly correlated to the mean in group A, so it seems like this would bias the results in favor of genes that are downregulated in group B, since low counts in the 5 group B samples would not significantly affect the overall mean CPM. The obvious alternative would be to filter on a weighted average CPM where each sample is weighted by the inverse of its group size, so that the total weight of each group is equal. However, this is no no longer blind to the experimental design, since the group labels were used to assign the weights.
So is either of these methods correct? If so, which one? If not, what is the right way to do this?