Independent filtering with unequal group sizes?
3
0
Entering edit mode
@ryan-c-thompson-5618
Last seen 23 months ago
Scripps Research, La Jolla, CA

It is pretty well-established at this point that independent filtering of low-abundance genes based on average CPM across all samples irrespective of which group they belong to is a good way to increase one's statistical power. However, I'm wondering how one would implement this correctly in the case where the treatment groups vary greatly in size. For example, imagine that group A has 20 samples while group B has only 5. In this case, then the average CPM across all samples would be highly correlated to the mean in group A, so it seems like this would bias the results in favor of genes that are downregulated in group B, since low counts in the 5 group B samples would not significantly affect the overall mean CPM. The obvious alternative would be to filter on a weighted average CPM where each sample is weighted by the inverse of its group size, so that the total weight of each group is equal. However, this is no no longer blind to the experimental design, since the group labels were used to assign the weights.

So is either of these methods correct? If so, which one? If not, what is the right way to do this?

independent filtering edger deseq2 • 1.3k views
5
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 4 hours ago
The city by the bay

Our simulations indicate that the average abundance (i.e., the overall NB mean, computed by fitting an intercept-only model to the counts) is independent of the p-value distribution for DE tests on NB-distributed counts. This seems to work regardless of the size of the groups, and it's the filtering strategy that I usually use for sequencing data prior to analyses with edgeR.

To answer your original question; imagine a situation where we have 99 samples in one group, and 1 sample in another group (assuming normally-distributed values for simplicity). Compute the overall unweighted mean across all groups. Define a large filter threshold for the overall mean. Genes are only retained if they have overall means above this threshold, which generally means that either:

1. the mean for the first group is high enough to drag the overall mean above the threshold. This is unlikely, as the sample mean is computed across 99 samples and has fairly low variance.
2. the mean for the second group is high enough to drag the overall mean above the threshold. This is also unlikely, as even though the variance is larger for one observation, the size of the observed value must be a lot larger to pull the other 99 samples along.

In short, and with a bit of handwaving, these two end up cancelling out, such that you don't get any systematic difference between groups in the set of genes after filtering. You can check this out:

> y <- matrix(rnorm(1000000), ncol=100)
> rwsm <- rowMeans(y)
> logFC <- rowMeans(y[,-1]) - y[,1]
> mean(logFC[rank(rwsm)/length(rwsm) > 0.95]) # Taking the top 5% of overall means
[1] -0.002398321 # Pretty close to zero

The independence of the sample mean for normal distributions can be proved more generally - Bourgon et al.'s 2010 paper (see Mike's link) has a proof in the supplementaries, I think.

1
Entering edit mode

Just to add, the average abundance (NB mean) that Aaron refers to is the AveLogCPM quantity that is automatically computed as part of the edgeR pipeline. It is the same quantity that appears on the logCPM column of the topTags output. It is an approximately independent statistic, relative to any contrast of the group means not involving the intercept, regardless of the size of each group.

In the limma-voom context, the equivalent quantity would be the weighted average of the logCPMs, using the voom weights.

2
Entering edit mode
@mikelove
Last seen 19 hours ago
United States

hi Ryan,

That's a good point and question, where you might want to take the weighted average or something like the average of the within-group averages.

The criteria for whether the independent filtering step will increase type I error is not whether it is blind to experimental design actually, but this:

• Is the filter statistic independent of the test statistic (e.g. p-value) under the null hypothesis? [1]

I believe that the average of the within-group averages would be independent, but you could do a null hypothesis simulation with t-tests and normal data.

0
Entering edit mode
@steve-lianoglou-2771
Last seen 14 hours ago
United States

I don't know how well established that filtering criterion you described is ... I mean, that's not how I do it :-)

I'd rather filter the data you describe by by taking only the genes that are expressed >= x cpm in >= 5 samples, where x is some minimal expression value.

At some point, I think this was the recommended workflow outlined in the voom part of the limma user's guide from some time ago, although it looks like the filtering strategies outlined in the limmaUsersGuide has since changed upon a reskim of the current release version.

Still, it seems to make sense to me, and I don't think I'm doing something too boneheaded, but I admit there's always the chance of my missing something obvious.

1
Entering edit mode

No, that's not boneheaded at all, and we regularly use this approach. We have recommended and still use the k cpm over x filtering approach in conjunction with older edgeR pipelines, because it filters out genes that are all zero except for one or two outliers. The k cpm over x is not quite an independent filter, because it tends to select genes with larger overall variances, but it nevertheless has some good properties. It is usually slightly conservative.

In the context of limma-voom, or edgeR QL, or edgeR estimateDisp, you could consider filtering using an overall abundance measure instead and handle outlier genes using the robust options of those pipelines. In the edgeR glm pipeline, you could do the same with estimateGLMRobustDisp().

0
Entering edit mode

Well, you can also filter by a minimum cpm cutoff in a required number of samples: you're just filtering based on a specified quantile rather than the mean. But the same issue exists: a gene that is abundant in group A and not B is more likely to make the cutoff in the required number of samples than a gene that is equally abundant in group B but absent in A.