Question: Independent filtering with unequal group sizes?
gravatar for Ryan C. Thompson
3.3 years ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson7.0k wrote:

It is pretty well-established at this point that independent filtering of low-abundance genes based on average CPM across all samples irrespective of which group they belong to is a good way to increase one's statistical power. However, I'm wondering how one would implement this correctly in the case where the treatment groups vary greatly in size. For example, imagine that group A has 20 samples while group B has only 5. In this case, then the average CPM across all samples would be highly correlated to the mean in group A, so it seems like this would bias the results in favor of genes that are downregulated in group B, since low counts in the 5 group B samples would not significantly affect the overall mean CPM. The obvious alternative would be to filter on a weighted average CPM where each sample is weighted by the inverse of its group size, so that the total weight of each group is equal. However, this is no no longer blind to the experimental design, since the group labels were used to assign the weights.

So is either of these methods correct? If so, which one? If not, what is the right way to do this?

ADD COMMENTlink modified 3.3 years ago by Aaron Lun21k • written 3.3 years ago by Ryan C. Thompson7.0k
gravatar for Aaron Lun
3.3 years ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

Our simulations indicate that the average abundance (i.e., the overall NB mean, computed by fitting an intercept-only model to the counts) is independent of the p-value distribution for DE tests on NB-distributed counts. This seems to work regardless of the size of the groups, and it's the filtering strategy that I usually use for sequencing data prior to analyses with edgeR.

To answer your original question; imagine a situation where we have 99 samples in one group, and 1 sample in another group (assuming normally-distributed values for simplicity). Compute the overall unweighted mean across all groups. Define a large filter threshold for the overall mean. Genes are only retained if they have overall means above this threshold, which generally means that either:

  1. the mean for the first group is high enough to drag the overall mean above the threshold. This is unlikely, as the sample mean is computed across 99 samples and has fairly low variance.
  2. the mean for the second group is high enough to drag the overall mean above the threshold. This is also unlikely, as even though the variance is larger for one observation, the size of the observed value must be a lot larger to pull the other 99 samples along.

In short, and with a bit of handwaving, these two end up cancelling out, such that you don't get any systematic difference between groups in the set of genes after filtering. You can check this out:

> y <- matrix(rnorm(1000000), ncol=100)
> rwsm <- rowMeans(y)
> logFC <- rowMeans(y[,-1]) - y[,1]
> mean(logFC[rank(rwsm)/length(rwsm) > 0.95]) # Taking the top 5% of overall means
[1] -0.002398321 # Pretty close to zero

The independence of the sample mean for normal distributions can be proved more generally - Bourgon et al.'s 2010 paper (see Mike's link) has a proof in the supplementaries, I think.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Aaron Lun21k

Just to add, the average abundance (NB mean) that Aaron refers to is the AveLogCPM quantity that is automatically computed as part of the edgeR pipeline. It is the same quantity that appears on the logCPM column of the topTags output. It is an approximately independent statistic, relative to any contrast of the group means not involving the intercept, regardless of the size of each group.

In the limma-voom context, the equivalent quantity would be the weighted average of the logCPMs, using the voom weights.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Gordon Smyth35k
gravatar for Michael Love
3.3 years ago by
Michael Love20k
United States
Michael Love20k wrote:

hi Ryan,

That's a good point and question, where you might want to take the weighted average or something like the average of the within-group averages.

The criteria for whether the independent filtering step will increase type I error is not whether it is blind to experimental design actually, but this: 

  • Is the filter statistic independent of the test statistic (e.g. p-value) under the null hypothesis? [1]

I believe that the average of the within-group averages would be independent, but you could do a null hypothesis simulation with t-tests and normal data.


ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Michael Love20k
gravatar for Steve Lianoglou
3.3 years ago by
Steve Lianoglou12k wrote:

I don't know how well established that filtering criterion you described is ... I mean, that's not how I do it :-)

I'd rather filter the data you describe by by taking only the genes that are expressed >= x cpm in >= 5 samples, where x is some minimal expression value.

At some point, I think this was the recommended workflow outlined in the voom part of the limma user's guide from some time ago, although it looks like the filtering strategies outlined in the limmaUsersGuide has since changed upon a reskim of the current release version.

Still, it seems to make sense to me, and I don't think I'm doing something too boneheaded, but I admit there's always the chance of my missing something obvious.

ADD COMMENTlink written 3.3 years ago by Steve Lianoglou12k

No, that's not boneheaded at all, and we regularly use this approach. We have recommended and still use the k cpm over x filtering approach in conjunction with older edgeR pipelines, because it filters out genes that are all zero except for one or two outliers. The k cpm over x is not quite an independent filter, because it tends to select genes with larger overall variances, but it nevertheless has some good properties. It is usually slightly conservative.

In the context of limma-voom, or edgeR QL, or edgeR estimateDisp, you could consider filtering using an overall abundance measure instead and handle outlier genes using the robust options of those pipelines. In the edgeR glm pipeline, you could do the same with estimateGLMRobustDisp().

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Gordon Smyth35k

Well, you can also filter by a minimum cpm cutoff in a required number of samples: you're just filtering based on a specified quantile rather than the mean. But the same issue exists: a gene that is abundant in group A and not B is more likely to make the cutoff in the required number of samples than a gene that is equally abundant in group B but absent in A.

ADD REPLYlink written 3.3 years ago by Ryan C. Thompson7.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 118 users visited in the last hour