2.8 years ago by
Cambridge, United Kingdom
I think you're worrying too much about this. With any filtering strategy, you'll find boundary cases where the gene is just filtered out. As long as the CPM threshold is low, these boundary cases should only occur at low-abundance genes (which are the ones we're trying to remove in the first place). So it shouldn't matter too much whether they're left in or out. The key is to remove the bulk of low-abundance genes to avoid funny-looking trends in the NB (or QL) dispersions due to strange GLM behaviour at low, discrete counts. If you do that, you'll be fine.
In my analyses, I prefer to use an average log-CPM threshold, i.e., by removing genes where the
aveLogCPM value falls below a certain minimum threshold (usually around 0 or 1, depending on the sequencing depth). This is blind to the experimental design and means that I don't have to change my pipelines for different designs. It also has some nice statistical properties, being roughly independent of the p-value, whereas the "at least X" filtering strategy is not. This ensures that filtering doesn't bias the DE statistics, e.g., by selecting genes that are more likely to be false positives.
In your case, as long as a subset of samples express the gene, it has a chance to be retained by the average log-CPM filter. Of course, if you have fewer samples, you'll need more expression in each of them to pass the average threshold. Note that this strategy has a tendency to include outliers when your data is noisy. This is because one or two strong samples with strong outlier expression will bump up the average. I usually rely on the robustness algorithms in
glmQLFit and the like to protect against this in the final results.
modified 2.8 years ago
2.8 years ago by
Aaron Lun • 21k